Ambari Blueprints and One-Touch Hadoop Clusters

Jun 20, 2014 / By Alan Gardner

Tags:

For those who aren’t familiar, Apache Ambari is the best open source solution for managing your Hadoop cluster: it’s capable of adding nodes, assigning roles, managing configuration and monitoring cluster health. Ambari is HortonWorks’ version of Cloudera Manager and MapR’s Warden, and it has been steadily improving with every release. As of version 1.5.1, Ambari added support for a declarative configuration (called a Blueprint) which makes it easy to automatically create clusters with many ecosystem components in the cloud. I’ll give an example of how to use Ambari Blueprints, and compare them with existing one-touch deployment methods for other distributions.

Why would I want that?

I’ve been working on improving the methodology used by the Berkeley Big Data Benchmark. Right now spinning up the clusters is a relatively manual process, where the user has to step through the web interfaces of Cloudera Manager and Ambari, copy-paste certificates and IPs, and assign roles to nodes. The benchmark runs on EC2 instances, so I’ve been focused on automatic ways to create clusters on Amazon:

  • Apache Whirr can create a Hadoop cluster (or a number of other Big Data technologies), including CDH5, MapR and HDP. Documentation is sparse, and there doesn’t appear to be support for installing ecosystem projects like Hive automatically.
  • Amazon EMR supports installing Hive and Impala natively, and other projects like Shark via bootstrap actions. These tend to be older versions which aren’t suitable for my purposes.
  • MapR’s distribution is also available on EMR, but I haven’t used that since the different filesystem (MapRFS vs. HDFS) would impact results.

Hive-on-Tez is only supported on HDP at the moment, so it’s crucial that I have a one-touch command to create both CDH5 clusters, but also HDP clusters. Ambari Blueprints provide a crucial piece of the solution.

The Blueprint

Blueprints themselves are just JSON documents you send to the Ambari REST API. Every Ambari Blueprint has two main parts: a list of “host groups”, and configuration.

Host Groups

Host groups are a set of machines with the same agents (“components” in Ambari terms) installed – a typical cluster might have host groups for the NameNode, SecondaryNameNode, ResourceManager, DataNodes and client nodes for submitting jobs. The small clusters I’m creating have a “master” node group with the NameNode, ResourceManager, and HiveServer components on a single server and then a collection of “slaves” running the NodeManager and DataNode components. Besides a list of software components to install, every host group has a cardinality. Right now this is a bit of a pain, since the cardinality is exact: your blueprint with 5 slave nodes must have 5 slaves Hopefully the developers will add an option for “many”, so we don’t have to generate a new blueprint for every different sized cluster.  Thanks to John from HortonWorks for a correction, cardinality is an optional hint which isn’t validated by Ambari. This wasn’t clear from the docs.

To provide a concrete example, the sample host groups I’m using look like this:

"host_groups" : [
 {
 "name" : "master",
 "components" : [
 {
 "name" : "NAMENODE"
 },
 {
 "name" : "SECONDARY_NAMENODE"
 },
 {
 "name" : "RESOURCEMANAGER"
 },
 {
 "name" : "HISTORYSERVER"
 },
 {
 "name" : "ZOOKEEPER_SERVER"
 },
 {
 "name" : "HIVE_METASTORE"
 },
 {
 "name" : "HIVE_SERVER"
 },
 {
 "name" : "MYSQL_SERVER"
 }
 ],
 "cardinality" : "1"
 },
{
 "name" : "slaves",
 "components" : [
 {
 "name" : "DATANODE"
 },
 {
 "name" : "HDFS_CLIENT"
 },
 {
 "name" : "NODEMANAGER"
 },
 {
 "name" : "YARN_CLIENT"
 },
 {
 "name" : "MAPREDUCE2_CLIENT"
 },
 {
 "name" : "ZOOKEEPER_CLIENT"
 },
 {
 "name" : "TEZ_CLIENT"
 },
 {
 "name" : "HIVE_CLIENT"
 }
 ],
 "cardinality" : "5"
 }

This host_groups describes a single node with all of the “master” components installed, and five slaves with just the DataNode, NodeManager and clients installed. Note that some components have depedencies: it’s possible to build an invalid blueprint which contains a HIVE_METASTORE but not a MYSQL_SERVER. The REST API provides appropriate error messages when such a blueprint is submitted.

Configuration

Configuration allows you to override the defaults for any services you’re installing, and it comes in two varieties: global, and service-specific. Global parameters are required for different services: to my knowledge Nagios and Hive require global parameters to be specified – these parameters apply to multiple roles within the cluster, and the API will tell you if any are missing. Most cluster configuration (your typical core-site.xml, hive-site.xml, etc. parameters) can be overriden in the blueprint by specifying a configuration with the leading part of the file name, and then providing a map of the keys to overwrite. The configuration below provides a global variable that Hive requires, and it also overrides some of the default parameters in hive-site.xml. These changes will be propagated to the cluster as if you changed them in the Ambari UI.

"configurations": [
  {
    "global": {
      "hive_metastore_user_passwd": "p"
    }
  },
  {
    "hive-site": {
      "javax.jdo.option.ConnectionPassword": "p",
      "hive.security.authenticator.manager": "org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator",
      "hive.execution.engine": "tez",
      "hive.exec.failure.hooks": "",
      "hive.exec.pre.hooks": "",
      "hive.exec.post.hooks": ""
    }
  }
]

This config will override some parameters in hive-site.xml, as well as setting the metastore password to ‘p’. Note that you can specify more configuration files to override (core-site.xml, hdfs-site.xml, etc.) but each file must be it’s own object in the configurations array, similar to how global and hive-site are handled above.

Once you’ve specified the host groups and any configuration overrides, the Blueprint also needs a stack – the versions of software to install. Right now Ambari only supports HDP – see this table for the stack versions supported in each Ambari release. As a weird constraint, the blueprint name is inside the blueprint itself, along with the stack information. This name must be the same as the name you provide to the REST endpoint, for some reason. To upload a new blueprint to an Ambari server you can use:

$ curl -X POST -H 'X-Requested-By: Pythian' <ambari-host>/api/v1/blueprints/<blueprint name> -d @<blueprint file>

The X-Requested-By header is required, and as noted the blueprint name must match the file.

You can see the entire blueprint file from this example here, feel free to use it as a baseline for your cluster.

Creating the Cluster

Once you’ve written a blueprint with the services and configuration you want, you need to:

  • Create EC2 instances with the correct security groups
  • Install ambari-master on one, and ambari-agent on the others
  • Configure the agents to report to the master
  • Write a file mapping hosts to host groups
  • Push both files (the blueprint and the mapping) to the REST API

Fortunately, we have a Python script that can do that for you! This script will create a benchmarking cluster with a specific number of data nodes, an Ambari master and a separate Hadoop master. It can easily be modified to create multiple classes of machines, if you want to have more host groups than “master” and “slave”. The core of the script (the EC2 interaction and Ambari RPM installation) is stolen from based on work by Ahir Reddy from Databricks, with the Ambari Blueprints support added by yours truly.

If you’re curious about the host mapping file: it has the blueprint name, and an array of host names for every host_group. Corresponding to the example above, the cluster definition would be:

{
  "blueprint":"hadoop-benchmark",
  "host_groups: [
    { 
      "name":"master",
      "hosts":[{"fqdn":"host-1"}]
    },
    {
      "name":"slaves",
      "hosts":[ 
        {"fqdn":"host-2"},
        {"fqdn":"host-3"}  
      ]
  ]
}

You could replace “host-n” with the real domain names for your Amazon instances (use the internal ones!), and create a new cluster over those machines using:

$ curl -X POST -H 'X-Requested-By: Pythian' <ambari-host>/api/v1/clusters/<cluster name> -d @<mapping file>

Conclusion

Ambari Blueprints have some rough edges right now, but they provide a convenient way to deploy all of the services supported by the HDP stack. Watch this space for more posts about my effort to create a repeatable, one-touch, cross-distribution Hadoop SQL benchmark on EC2.

2 Responses to “Ambari Blueprints and One-Touch Hadoop Clusters”

  • John Speidel says:

    Regarding the cardinality field specified for a host group in the blueprint, this field is optional and is only used as a hint to the deployer and can be any string. So, the same blueprint can be used for any size cluster as the cardinality field is only provided as meta-info for the blueprint user and is not validated. When a blueprint is exported for an existing cluster, the cardinality value specifies the number of hosts in the cluster which are represented by the host group.

    • Alan Gardner says:

      That’s excellent, thanks John. It wasn’t clear to me from the documentation, it seemed that the cardinality was always specified as an integer, but this makes much more sense. I’ll update the post to clarify this point.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>