Getting started with Hadoop on Cassandra

 

Hadoop

This guide explains how to build a massively scalable BigData crunching analytical system by overlaying Apache Hadoop on an Apache Cassandra database ring – replacing Hadoops’ file system (HDFS) as the primary store of both raw and processed data – as well as writing and modifying a MapReduce java program so that it’s comparable with this setup.

 

Why would you want to do this?

I guess that I can think of at least two reasons why you would want to do this. If you are using, or plan to use, Cassandra then you might be running into some of its limitations in dealing with analytical queries. The distributed nature of Cassandra matches very nicely the distributed nature of MapReduce so, if you need to run queries that touch data spanning multiple nodes, then Hadoop and MapReduce is a logical fit – the right tool for the job.

The second reason is because of limitations with the Hadoop File System (HDFS). To my mind it’s not the greatest or most flexible of data storage engines, requiring chunks of data to be written in files and transferred in. OK, I admit that it continues to evolve and there are many tools in the Hadoop ecosystem to help, but I say nothing quite beats the simplicity and flexibility of having Cassandra, its query language (CQL) and multiple driver support as your data storage layer for MapReduce, both for the input and output of data.

What about HBase, the natural document orientated NoSQL database that is pre-integrated into Hadoop? Well some just prefer Cassandra, and some like the fact that Cassandra has peer-to-peer and not master-slave replication (unlike HDFS and so HBase). I fall into both of these camps.

 

Who is this guide for?

Hadoop, MapReduce and Cassandra are not point-and-click technologies. There is quite a bit(!) of Linux configuration and some Java coding. Also I going to assume you know a bit about Cassandra and can setup, or already have setup, a multi-node Cassandra ring. Having an understanding Cassandra Query Language (CQL) and Cassandra’s data model, Keyspaces and Column Families, will be useful to take this forward. This guide does completely cover the setup of Hadoop, setup of a Java development environment using Eclipse and writing a MapReduce job, both using HDFS and then Cassandra as the data store. We don’t cover the mechanics of MapReduce though – that is a topic all of its own!

 

 

The Guide

The guide consists of 7 parts. We start with the preparation of your cluster. We then deploy Hadoop, set up our Java development environment for writing and debugging MapReduce code, then deploy this code to Hadoop.

Finally we modify our Hadoop configuration to work with Cassandra and amend our MapReduce code accordingly.

Throughout I have tried to be as clear as possible, however no one, especially me, is perfect, and I’m bound to have made some mistakes and omissions. I’ll leave the comments sections open, so if you spot a mistake, or think you can add something of interest, please feel free to post – both for me to learn and also the greater community good! :)

Also if you get stuck, feel free to contact me via one of the ways mentioned in the contact section and I’ll do my best to help.

 

 

Leave a Reply