9 main components to learn Apache Hadoop

Learning Hadoop

 When organizations start thinking about Big Data, developers are often put in a difficult situation to come up with quick, profitable solutions. One of the first buzzwords thrown during brainstorming for solutions to Big Data challenges is Apache Hadoop. Software Developers might have an idea of what Hadoop is, but they are definitely not experts in the matter. Bringing a new technology might excite a developer’s brain, but unrealistic expectations from the business might complicate the journey to Hadoop as a Big Data solution.

      Hadoop is not just a tool. It is an ecosystem where each piece is fundamental to crack the Big Data challenges big and medium sized companies face in this information era. With that being said, developers need to be careful on how to learn the ecosystem behind Hadoop.


Here are the 9 main components to learn Apache Hadoop:

1.- HDFS the Hadoop File System: it is pretty much the central storage of ALL you data. This includes structured data (i.e.Relational Tables) and unstructured data (i.e. log files).

2.- MapReduce programming model designed to process big data. The key to MapReduce is that it runs on large clusters of commodity hardware. If you are hitting close to capacity, you just have to add more “cheap” commodity hardware in order to scale. MR is easy to learn the concept, but hard to become an expert.

3.- Spark is a fast and general engine for large-scale data processing. It is gaining traction in the Big Data community as it can process large amount of data in memory.

4.- Pig This is a data flow language. It seems like a functional language and its purpose is to manipulate very large datasets. It gets the data directly from the Hadoop File Sytem (HDFS) and process it in the MapReduce clusters.

5.- Hive a.k.a SQL on Hadoop. It is a distributed data warehouse on top of HDFS. You can query the data easily and there is not much of a learning curve if you know SQL. The queries are translated to MapReduce jobs under the covers.

6.- Sqoop You need a way to transfer data to HDFS. Sqoop transfers structured data stores from relational databases to HDFS.

7.- Flume transfers unstructured data (log files) to HDFS easily.

8.- HBase Hadoop Database – people often confuse HDFS and HBase. It is not the same thing. HBase is a distributed, column-oriented database on top of HDFS.

9.- Oozie A scheduling workflow for any Hadoop jobs. Once you have created your Pig/MapReduce/Hive jobs, you need an scheduling assistant so you can start your jobs and track them.

Hadoop is an ever growing ecosystem. People in the open source community are working very hard to bring new tools and improve the existing ones. With these 9 components, developers can start having a solid foundation about what Hadoop brings as a Big Data solution.