Monday, January 16, 2017

The quick and dirty on getting a Hadoop cluster up and running

The last time I tested out a hadoop cluster, it was about four years ago. At the time, the setup was manual and I used two machines to set up a two node cluster. I was able to run map reduce jobs and test the system out. Fast forward to 2017 - there are a lot more animals in the Hadoop circus and cloudera has made it very convenient by having various options to test - a quickstart VM, docker images etc. Still getting the system up and running and executing the first sqoop job took some effort. I have documented what I did including the tweaks so anyone running into the same can get help. Here are the steps:

  1. Get a system ready - at least 16GB or RAM. I had a Linux Mint box on hand that I used. Mint is generally similar to Ubuntu but you do have to watch out for version specific instructions. 
  2. The next step is to install Docker. For Linux Mint 18, the steps here from Simon Hardy came really handy.
  3. Cloudera provides a Docker Quickstart container - do not use that. There is a lot of documentation and links on that and its quite easy to go down that path. A better option is to use the Cloudera Clusterdock. Clusterdock is a multi-node cluster deployment on the same Docker host (by default it does a two node). The clusterdock documentation is very useful but there are a few catches that I will note here. There are few other links on clusterdock here:
    1. The cloudera online tutorial is based on the quickstart docker container or the quickstart VM. There are several dependencies including a mysql database, flume files that are used in the demo etc. I would suggest that you keep that container also around for a bit and copy the data as needed to the clusterdock nodes. The clusterdock setup is much more stable in the cloudera manager and the slight inconvenience (occasional hardware freezing) may be worth it. After launching the quickstart container, you may simply tar the /opt/examples folder and the mysql retail_db database and transfer that to the host machine using the docker cp command and then kill the quickstart container.
    2. The clusterdock.sh script on the cloudera website lacks 'sudo' in a couple of places. Be aware of that. Its easy to spot that in case it causes a problem. For example the ssh function has this problem.
    3. I wanted to run the sqoop command and for that I needed a myql database to connect to. There was one I installed on the host machine. It was almost impossible or looked very time consuming to connect the clusterdock containers to the host mysql. You run into a docker networking issue. The easy way out is to install a mysql docker container and put it on the same user defined network that the clusterdock nodes use. Note that you have to force the IP and the network on the mysql container to do that and also map the mysql port 3306 so its open for access. Else you will waste a lot of time!
  4. Next step is to ssh on the master node and run a sqoop job. At this time you will run into a lot of permissions issues if you are not careful with where you are storing the target imported files. sqoop will generally report exceptions stack trace with these permissions errors. Best is to google and fix the any paths you give to the sqoop command.
  5. You also need to copy the mysql jar file in the sqoop lib folders. Easiest way is to get it on the host box and then use the docker cp commands to move it to the desired location.
  6. That should do it - get you past the sqoop step and then you can run a query in Hue.

No comments: