Apache Cassandra – The Hadoop NoSQL solution

About Apache Cassandra

  • Apache Cassandra is based of Google’s Big Table and Amazon Dynamo.
  • Massively scalable open source NoSQL database.
  • Perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud.
  • Delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure.
  • Capable of handling petabytes of information and thousands of concurrent users/operations per second.

How does Cassandra fit in Hadoop ?

Like Apache Cassandra, Apache Hadoop is a distributed system. Cassandra’s built-in Hadoop integration provides the jobtracker with data locality information so that tasks can be close to the data.

Cassandra also supports MapReduce and Apache Pig which are widely used for parallel data processing in HDFS. It can be used in combination to store data in Cassandra after processing from MapReduce or Apache Pig in the following manner:

The Job Tracker/Resource Manager (JT/RM) receives MapReduce input from the client application. The JT/RM sends a MapReduce job request to the Task Trackers/Node Managers (TT/NM) and optional clients, MapReduce and Pig. Then data is written to Cassandra and results sent back to the client.

How it is used in Large Scale Projects ?

Apache Cassandra is capable of handling all of the big data challenges that might arise in enterprise applications and real time systems e.g.  massive scalability, high performance, strong security, fault tolerance and ease of management.

Presenting some use cases of Apache Cassandra and collaboration with big companies:

Collaborate in real-time Messaging: Real-time monitoring and alerts across channels, sites and data centers, online discussions, social media engagement and shared activities. Easily manage high volumes of dynamic data to store, search and analyze user activity.

Users of Apache Cassandra for Messaging: Comcast, Accenture, Instagram etc…

Designed for Internet of Things (IoT): It is built up to consume time-series data and sensor-based information faster than any other database.

Users of Apache Cassandra for IoT: Aeris, Zonar, NREL , i2o …

Fraud Detection: Real-time monitoring across all channels and data centers easily spots suspicious activity before it escalates to compliance issues, identity theft and other crimes. It ensures massive data volumes can be analyzed with ease while multi-data center and cloud replication ensures applications and data suffer no downtime.

Users of Apache Cassandra for Fraud Detection: ebay,Instagram, Barracuda

Key Features of Apache Cassandra

  • Partitioned Row Store database:  Cassandra’s architecture allows any authorized user to connect to any node in any data center and access data using the CQL language.
  • Automatic data distribution: Cassandra provides automatic data distribution across all nodes that participate in a ring or database cluster.
  • Linear Scalability: Cassandra supplies linear scalability, meaning that capacity may be easily added simply by adding new nodes online. For example, if 2 nodes can handle 100,000 transactions per second, 4 nodes will support 200,000 transactions/sec and 8 nodes will tackle 400,000 transactions/sec.
  • Cassandra Architecture: Its architecture is based on the understanding that system and hardware failures can and do occur so Cassandra addresses the problem of failures by employing a peer-to-peer distributed system across homogeneous nodes where data is distributed among all nodes in the cluster.

More about Apache Cassandra’s Architecture

  • There is a sequentially written commit log on each node which captures write activity to ensure data durability.
  • Data is then indexed and written to an in-memory structure(mem table), called a memTable.
  • Once the memory structure is full, the data is written to disk in an SSTable data file. 
  • All writes are automatically partitioned and replicated throughout the cluster.
  • Using a process called Compaction Cassandra periodically consolidates SSTables to keep the latest data in storage.

Reads and Writes in Apache Cassandra

Writes

Logging writes and memtable storage: When a write occurs, Cassandra stores the data in a structure in memory i.e. memtable, and also appends writes to the commit log on disk.

The memtable stores writes until reaching a limit, and then is flushed.

Flushing data from the memtable: When memtable contents exceed a configurable threshold, the memtable data, which includes indexes, is put in a queue to be flushed to disk.

Storing data on disk in SSTables: Data in the commit log is purged after its corresponding data in the memtable is flushed to an SSTable.

Reads

To satisfy a read, Cassandra must combine results from the active memtable and potentially mutliple SSTables.

First, Cassandra checks the Bloom filter. Each SSTable has a Bloom filter associated with it that checks the probability of having any data for the requested partition in the SSTable before doing any disk I/O.

If the Bloom filter does not rule out the SSTable,Cassandra checks the partition key cache and takes one of these courses of action:

If an index entry is found in the cache, Cassandra goes to the compression offset map to find the compressed block having the data. Fetches the compressed data on disk and returns the result set.

If an index entry is not found in the cache, Cassandra searches the partition summary to determine the approximate location on disk of the index entry.

apache cassandra

Cassandra Query Language(CQL)

Cassandra queries can also be executed on a cqlsh (cqlshell).

Creating and updating a keyspace

Creating a keyspace is the CQL counterpart to creating an SQL database, but a little different.The Cassandra keyspace is a namespace that defines how data is replicated on nodes.Typically, a cluster has one keyspace per application and Replication is controlled on a per-keyspace basis.

cqlsh> CREATE KEYSPACE demodb WITH REPLICATION = { ‘class’ :’NetworkTopologyStrategy’, ‘datacenter1’ : 3 };

Now use this keyspace to create tables and querying data. 

USE demodb;

Creating a table in the keyspace

CREATE TABLE users (

user_name varchar,

password varchar,

gender varchar,

session_token varchar,

state varchar,

birth_year bigint,

PRIMARY KEY (user_name));

Using a compound primary key

Use a compound primary key to create columns that you can query to return sorted results.

CREATE TABLE emp (

empID int,

deptID int,

first_name varchar,

last_name varchar,

PRIMARY KEY (empID, deptID));

The compound primary key is made up of the empID and deptID columns in this example. The empID acts as a partition key for distributing data in the table among the various nodes that comprise the cluster.The remaining component of the primary key, the deptID, acts as a clustering mechanism and ensures that the data is stored in ascending order on disk.

Inserting data into a table

INSERT INTO emp (empID, deptID, first_name, last_name)VALUES (104, 15, ‘jane’, ‘smith’);

Using Collection DataTypes 

Set 

A set stores a group of elements that are returned in sorted order. A column of type set consists of unordered unique values. Using the set data type, you can solve the multiple email problem in an intuitive way that does not require a read before adding a new email address.

Define a set, emails, in the users table to accommodate multiple email address

CREATE TABLE users (

user_id text PRIMARY KEY, first_name text, last_name text, emails set<text>

);

Insert data into the set, enclosing values in curly brackets. Set values must be unique.

INSERT INTO users (user_id, first_name, last_name, emails)

VALUES(‘frodo’, ‘Frodo’, ‘Baggins’, {‘f@baggins.com’,’baggins@gmail.com’});

List

 In the list order is preserved and it allows duplicates. Add a list declaration to a table by adding a column top_places of the list type to the users table.

ALTER TABLE users

 ADD top_places list<text>;

Use the UPDATE command to insert values into the list.

UPDATE users 

SET top_places = [ ‘rivendell’, ‘rohan’ ]

WHERE user_id = ‘frodo’;

Map

A map is a name and a pair of typed values.Each element of the map is internally stored as one Cassandra column that you can modify, replace, delete, and query. Add a todo list to every user profile in an existing users table using the CREATE TABLE or ALTER:

ALTER TABLE users ADD todo map<timestamp, text>;

UPDATE users

       SET todo ={ ‘2012-9-24’ : ‘enter mordor’,’2014-10-2 12:00′ : ‘throw ring into mount doom’ }

WHERE user_id = ‘frodo’;

User Defined Types

You are able to create an user-defined type after Cassandra’s version 2.1 and later.

CREATE TYPE mykeyspace.address (

street text,

city text,

zip_code int,

phones set<text>

);

CREATE TYPE mykeyspace.fullname (

firstname text,

lastname text

);

Create a table for storing user data in columns of type fullname and address. Use the frozen keyword in the definition of the user-defined type column : 

CREATE TABLE mykeyspace.users (

id uid PRIMARY KEY,

name frozen <fullname>,

direct_reports set<frozen <fullname>>, // a collection set

addresses map<text, frozen <address>> // a collection map

);

When using the frozen keyword, you cannot update parts of a user-defined type value.

Insert a user’s name into the fullname column: 

INSERT INTO mykeyspace.users (id, name)

VALUES(62c36092-82a1-3a00-93d1-46196ee77204, {firstname: ‘Marie-Claude’,lastname: ‘Josset’});

Insert an address labeled home into the table:

UPDATE mykeyspace.users

SET addresses = addresses + {‘home’: { street:

‘191 Rue St. Charles’, city: ‘Paris’, zip_code: 75015, phones: {’33 6 78

90 12 34′}}}

WHERE id=62c36092-82a1-3a00-93d1-46196ee77204;

Try Cassandra

http://www.planetcassandra.org/try-cassandra/

Conclusion

There are many NoSQL solutions available for Big Data developers. Apache Cassandra is one of the most preferred in today’s Big Data world. Apache Cassandra’s ease of use and ease of scalability makes it very attractive for web-scale applications in today’s rapidly growing Big Data projects.

[Video] Webinar: Introduction to Cassandra