How to Deploy a Cluster

Posted in: Big Data, Hadoop, Technical Track

 

In this blog post I will talk about how to deploy a cluster, the methods I tried and my solution to resolving the prerequisites problem.

I’m fairly new to the big data field. Learning about Hadoop, I kept hearing the term “clusters”, deploying a cluster, and installing some services on namenode, some on datanode and so on. I also heard about Cloudera manager which helps me to deploy services on my cluster, so I set up a VM and followed several tutorials including the Cloudera documentation to install cloudera manager. However, every time I reached the “cluster installation” step my installation failed. I later found out that there are several prerequisites for a Cloudera Manager Installation, which was the reason for the failure to install.

 

Deploy a Cluster

Though I discuss 3 other methods in detail, ultimately I recommend method 4, provided below.

Method 1:

I ran my very first MapReduce job on “labianchin/hadoop” Docker image. This was easy to use as it comes with Hadoop pre-installed. However it did not contain any other services. I manually installed Hive and Oozie, but soon ran into configuration issues.

Method 2:

A lot of people recommend using Cloudera’s quickstart-VM. Sadly my system did not meet the memory requirements, which led to me to look for other methods.

Method 3:  

Cloudera Manager installation on Google cloud. I will be discussing this method in detail below.

Method 4:

Cloudera Docker image. Released on December 1st, 2015. This is by far the quickest and easiest method to use Cloudera services. All services are pre-installed and pre-configured. I recommend the following by installing a single Google compute engine VM instance, as described below then install Docker, pull image and run the container.

 

Step-by-Step Guide to Cloudera Manager Installation on Google Cloud (Method 3)

I will be creating a cluster consisting of 4 nodes. 1 node will run Cloudera Manager and remaining 3 nodes will run services.

Create a Google compute engine VM instance:

  1. From the Developers console, under Resources, select ‘Compute Engine’.
  2. Click on new instance on top of the page.
  3. Enter an instance name.
  4. Select a time zone.
  5. Select machine type as ‘2vCPUs 13GB RAM n1-highmem-2’ with 80 GB disk and CentOS 6.7 image (Change disk space according to requirements).
  6. Create.
  7. When the instance is ready, select it and open in Edit Mode.
  8. Scroll down to ssh keys and enter your public key here. If you do not have a public key run the following commands or generate one using PuTTY:
    • ssh keygen -t rsa
    • cat ~/.ssh/id_rsa.pub
  9. Save.
  10. Create 3 clones of this instance.
  11. Start all 4 instances.
  12. Follow steps from “Repartition a root persistent disk” to expand your disk to allotted size [Repeat on all instances].

Prerequisites:

  • To allow nodes to SSH to each other, edit /etc/hosts file to include hosts from each node. Below is an example [Repeat on all Instances]:
  • Change swappiness to minimum value without disabling it [Repeat on all instances]:
  • Disable iptables:
  • Disable redhat_transparent_hugepage:
  • Install MySQL:
  • Install Java:
  • Download mysql-connector [Repeat on all instances]:
  • Install Cloudera manger:
  • Update database name, user and password in ‘db.properties’:
  • Start cloudera-server and observe logs until “Jetty Server started” appears. This may take a while:
  • Access cloudera manager from the browser to complete installation:
    • Install PuTTY.
    • Open your private key file in PuTTY pageant. By default this should be located in C:/users/username/.ssh/filename.ppk . PageAnt icon will appear in the system tray.
    • Fill in external IP of VM instance (of node where cloudera server is running) as hostname in PuTTY.
    • From the column on right, go to SSH > tunnels.
    • Enter the internal IP of VM instance in the destination textbox with port 7180. E.g. 10.240.0.2:7180.
    • For ease of remembering ports, set the source port as 7180 (Same as destination port). You can choose to redirect to another port if 7180 is not available. 7180 is the default port for Cloudera manager.
    • Apply and Open the connection.
    • Open the browser and go to “localhost:7180”.
    • Proceed with cluster installation using Cloudera manager.

Cluster Installation:

  • Login as “admin” , “admin”.
  • Accept the Terms and Conditions. Then continue.
  • Select ‘Cloudera Express” or “Cloudera Enterprise Data Hub Edition Trial”. Then continue.
  • Search for your machine’s hostnames. e.g.

cdh1

  • On the “Cluster Installation” page continue with all default settings.
  • On the “JDK Installation Options” page select “Install Oracle Java SE Development Kit (JDK)” and “Install Java Unlimited Strength Encryption Policy Files”. Continue.
  • Do not select “Single User Mode”. Continue
  • On “Provide SSH Login credentials” page, select Login to all hosts as ‘Another User’ with Authentication method ‘All hosts accept same private key’. Enter the username from SSH key that was added to GCE instance (This is the same user that logged in to PuTTY session). Select the private key file stored on your local machine. Continue without passphrase.
  • On the next page, the cluster installation may take a while. (NOTE: If you install Cloudera manager without following the prerequisites, installation will fail at this step).

 

cdh3

  • Once the installation is complete, continue to install parcels and inspect hosts. Finally, continue to the Cluster Setup.

cdh4

  • Select ‘Core Hadoop’ when asked to select CDH5 services.
  • When Assigning roles, I like to assign all ‘Cloudera Management Service’ roles to one node (in my case ‘cloudera-cm’) and distribute all other roles evenly on the remaining 3 nodes. Here is one possible assignment of roles:

cdh5_after

  • On the Database Setup page, set all ‘Database Host Name’ fields to the node running the Cloudera-server. Enter Database name, Username and Passwords that were created in MySQL earlier.

cdh6_db

  • Review the changes. Now the cluster will be setup and services deployed. This is the final step.

cdh7

  • You are now ready to use services directly from console or access Hue on port 8888. Good Luck :)

 

P.S. I would like to thank Manoj Kukreja for showing me the right way of deploying clusters.

 

Discover more about our expertise in Big Data.

email

Interested in working with Zunaira? Schedule a tech call.

About the Author

Zunaira is a software developer enrolled in a Masters in Computer Science program. Within the last year, she gained experience working as a Jr. C# developer working on automated testing and as a Big Data intern working with technologies like hadoop, hive, oozie, docker etc. Zunaira's interests include using NLP to find answers to real world problems and developing automation tools.

3 Comments. Leave new

Thanks for sharing . It is really beneficial for those who are new in cluster and big data. I am having some questions:
Can we run this image in 4GB RAM?
How to work on deploy cluster?
In every cluster we have installed all the hadoop components?

Reply

@anwar I’m assuming you are referring to cloudera docker image (mentioned in method 4). I believe it will work with 4GB RAM, however I will test it out to be sure. when you will run the container it will deploy a single node cluster with almost all services installed. you will be able to see the complete list while container is starting. (however, cloudera manager is not installed by default)

If you want to deploy multi-node cluster, follow steps mentioned above.

When setting up cluster using cloudera manager, you will be able to select services you want to install. these will be distributed across all nodes on a cluster. These hadoop components may not necessarily be on one node

Reply

@Zunaiira, Thanks it is really useful for me

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *