What is Hadoop? from former CEO of Cloudera

hadoop introduction to Big Data

Hadoop

  • According to Apache, it is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • It allows you to handle large amounts of data on very cheap commodity hardware.
  • it allows you to process large amounts of data very fast, therefore allowing you to make faster data-driven decisions in your organization.
  • Ecosystem of tools that allows you to load, process, transform, analyze and visualize data in ways never seen before.

Interview with Mike Olson

Mike Olson, former CEO of Cloudera, gives a more complex explanation of what Hadoop is and explains the history behind the Big Data movement. He also talked about a few of the main technologies around the Hadoop ecosystem.

This is a very good introductory video to understand Hadoop from the key player’s perspective. Cloudera is the top Hadoop distribution in the market now.

It will give also you a perspective on the past, present, and future of these Big Data technologies.

Main topics in the interview

  • History: How and where all started
  • MapReduce: What it does in terms of data processing
  • Machine-generated data: how data is being generated
  • Price of data getting cheaper
  • Data processing/Complex data
  • HDFS (Hadoop Distributed File System)
  • Impact of Yahoo on Hadoop ecosystem
  • Advantages of being/working in Sillicon Valley
  • Using Virtualization to play with Hadoop
  • What Cloudera does in the Hadoop World
  • How to use Hadoop in your organization
  • What newbies should know before building a system

[Tutorial] Hadoop Essentials

What is Big Data?

A collection of data sets so large and complex that it becomes difficult to process it. Big Data includes many types and sources of data. Nowadays, data is growing exponentially, making it difficult to process and manage it using traditional systems.

Big Data characteristics

Volume:  Very large amounts of data.

Velocity: need to process quickly.

Variety: different types and sources of data.

Variability: ability to understand the context of your data and how it fits together.

Big Data Types

Structured: pre-defined schema, highly structured, i.e. Relational Databases.

Semi-structured: inconsistent structured, cannot be stored in rows and tables, i.e. logs, tweets, sensor feeds.

Unstructured: lacks structure or parts of it does, most-growth seen in present, i.e. free form text, reports, customer feedback.

What is Hadoop?

  • Apache project.
  • Open-source scalable data processing platform that reliably runs on commodity hardware.
  • allows you to stored large amounts of data and analyze it in a fast way.
  • allows you to make better decisions.

Hadoop Characteristics

  • Open source.
  • Distributed data replication.
  • Uses commodity hardware.
  • Data and analysis co-location (data can live in different locations).
  • Reliable error handling.

Hadoop in the Enterprise

  • You can retain years of data.
  • Retain intermediate formats (other than raw data and final product).
  • Allows you to explore data without the need to transform and extract it.
  • Turns unstructured data into structured.

Hadoop vs Relational Database Systems

  • it will not replace Databases.
  • They both have different functionality.
  • They can coexist in a data warehouse.
  • Each Enterprise will have different use cases to make Big Data technologes and RDBMS coexist.

Credits

Robert Scoble for the interview and video