![[Assets/hadoop_logo.jpg|100]]
[Apache Hadoop](https://hadoop.apache.org/) is a data processing framework designed to batch process big amounts of data.
# Official Documentation
https://hadoop.apache.org/docs/current/
# Apache Hadoop Advantages
- **MapReduce** function
- *Divide and Conquer* strategy to deal with data
- HDFS system to store data
# Apache Hadoop Disadvantages
- Hadoop works on disk, which makes it slower than memory
- Low efficiency on small files
- High latency
# HDFS
**HDFS** or Hadoop Distributed File System is a native tool on Hadoop that let us store structured and non-structured data on a local cluster. Although HDFS is the main option and is the one Apache Spark uses as well, Hadoop offers other tools such as HFTP, HSFTP, WebHDFS, and Amazon S3.
# How does Hadoop work?
Apache Hadoop is settled upon a **Leader-Follower** system. A Leader node, also known as *NameNode* in HDFS, will be responsible for creating tasks and sending information to Follower nodes through the information nodes, known as *DataNodes*.
Once the *name node* and the *data nodes* are configured, the Leader will set a **job-tracker**. This job-tracker will have control over the tasks using **task-trackers** on the Follower nodes. This is done to prevent Followers from trying to complete every task, that is, job-trackers **indicate exactly** which tasks must be done on each follower node as well as which information every follower node needs to fetch from the data nodes.
This way of working, the *Divide and Conquer* system, is also what defines MapReduce on Hadoop. Tasks get divided to every follower node, which improves data processing speed on real large datasets. MapReduce consists of two functions:
- Map: Map will read important information in the HDFS environment, analyzing key-value pairs and sending them to the datanodes.
- Reduce: Reduce will retrieve all pairs and group them by its key to produce the final pairs. Once this finishes, the data will be stored in a datanode.
When the MapReduce function is finished, the leader will be informed.
![[Assets/mapreduce.png]]
# When to use Hadoop?
| **_Use it if..._** | **_Don't use it if..._** |
|:---------------------------------------------------------------:|:-----------------------------------------------------------:|
| **You are working with tasks that can be divided on side jobs** | **You are working with serial tasks or low latency tasks** |
%% wiki footer: Please don't edit anything below this line %%
## This note in GitHub
<span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Tools/Data%20Processing/Apache%20Hadoop.md "git-hub-edit-note") | [Copy this note](https://raw.githubusercontent.com/data-engineering-community/data-engineering-wiki/main/Tools/Data%20Processing/Apache%20Hadoop.md "git-hub-copy-note")</span>
<span class="git-footer">Was this page helpful?
[👍](https://tally.so/r/mOaxjk?rating=Yes&url=https://dataengineering.wiki/Tools/Data%20Processing/Apache%20Hadoop) or [👎](https://tally.so/r/mOaxjk?rating=No&url=https://dataengineering.wiki/Tools/Data%20Processing/Apache%20Hadoop)</span>