Roll your own Big Data Appliance
Feb 25, 2012 / By Gwen Shapira
One of the major drawbacks of Oracle Big Data Appliance is that I don’t have one yet. Its a major drawback, because I’m itching to play with shiny new software on fairly impressive hardware.
The chances of getting my hands on 18 servers each with 12 cores, 48g RAM and 84T storage each all connected by InfiniBand are not that great. But I can play with the software, and so can you.
Unlike Oracle’s Exadata, almost every software component that is available on the Big Data Appliance is also available for download. Some components are completely OpenSource and can be used freely (Hadoop, Oracle NoSQL, Open source R), some are available for download but require a license under Oracle’s usual terms (Hadoop Connectors, Enterprise R, Oracle NoSQL) and some seem to be plainly unavailable but have a limited free version (Cloudera Manager).
So, lets roll our own Big Data appliance!
- Grab a bunch of (virtual) servers to use for your cluster. EC2 servers on Amazon cloud are not a bad option.
- Note that you will need a lot of open ports: https://ccp.cloudera.com/display/CDHDOC/Configuring+Ports+for+CDH3
- Install Cloudera Manager Free Edition – https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Download
- Use Cloudera Manager Free Edition to install your Hadoop cluster. If you are on Amazon Cloud, try Whirr instead:
- Install Oracle NoSQL. There is an open source community edition and a licensed Enterprise edition. For now they are identical, so grab the open source one: http://www.oracle.com/technetwork/database/nosqldb/downloads/index.html
- Oracle connectors are not part of the Big Data Appliance spec, but a separate option. In my opinion, the connectors are actually one of the more exciting things Oracle is doing with Hadoop, so you should at least give Oracle’s Hadoop Loader a go. You can download them here: http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/index.html
But remember that the OTN version is for fun and games only. If you want to use it for serious work, the connector package licenses for 2000$ per CPU.
- The last component of the Big Data appliance is the R analytics platform. The open source version is free.
You can get it from the R Project: http://www.r-project.org/
Or get Oracle’s R distribution, which should be the same thing repackaged, but this is exactly the version that ships with the Big Data Appliance:
If you want more speed, get the parallel version:
And don’t forget the cute GUI:
Or grab Oracle’s R Enterprise: http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/ore-downloads-1502823.html
- Enterprise R is more awesome because instead of reading tons of raw data from Oracle to R, potentially causing network contention on the way, it will translate your R functions to SQL and run them in the database. Kind of like storage offloading, only between the statistical analysis layer and the DB. I didn’t play with this yet, but it sounds interesting.
Again, you can play with what you download from OTN, but if you want to use it for real, it is part of Oracle Advanced Analytics option for Oracle Database 11g R2. The price for this option is $23,000 per CPU.
Now that you have your own Big Data non-appliance, you should show it off on OTN:
Big Data forum: https://forums.oracle.com/forums/forum.jspa?forumID=1402
Big Data Connectors forum: https://forums.oracle.com/forums/forum.jspa?forumID=1403
To stay up to date with the latest news, you can follow the blogs:
Big Data blog: https://blogs.oracle.com/datawarehousing/
There’s a good post here on sizing Hadoop clusters that I recommend.
NoSQL blog: https://blogs.oracle.com/charlesLamb/
This has a good getting started guide.
And if anyone suspect that this blog post is a thin excuse for me to post all the links in one place so I won’t keep losing them, you may be on to something.