Hadoop FAQ – But What About the DBAs?
Jan 24, 2013 / By Gwen Shapira
There is one question I hear every time I present about Hadoop to an audience of DBAs. This question was also recently asked in LinkedIn’s DBA Manager forum, so I finally decided to answer it in writing, once and for all.
“As we all see there are lot of things happening on Big Data using Hadoop etc….
Can you let me know where do normal DBAs like fit in this :
DBAs supporting normal OLTP databases using Oracle, SQL Server databases
DBAs who support day to day issues in Datawarehouse environments .
Do DBAs need to learn Java (or) Storage Admin ( like SAN technology ) to get into Big Data ? ”
I hear a few questions here:
- Do DBAs have a place at all in Big Data and Hadoop world? If so, what is that place?
- Do they need new skills? Which skills?
Let me start by introducing everyone to a new role that now exists in many organizations: Hadoop Cluster Administrator.
Organizations that did not yet adopt Hadoop, sometimes imagine Hadoop as a developer-only system. I think this is the reason I get so many questions about whether we need to learn Java every time I mention Hadoop. Even within Pythian, when I first introduced the idea of Hadoop services, my managers asked whether we will need to learn Java or hire developers.
Organizations that did adopt Hadoop found out that any production cluster larger than 20-30 nodes will require a full time admin. This admin’s job is surprising similar to a DBA job – he is responsible for the performance and availability of the cluster, the data it countains and the jobs that run there. The list of tasks is almost endless, and also strangely familiar – deployment, upgrades, troubleshooting, configuration, tuning, job management, installing tools, architecting processes, monitoring, backups, recovery, etc, etc.
I did not see a single organization with production Hadoop cluster that didn’t have a full-time admin, but if you don’t believe me – note that Cloudera is offering Hadoop Administrator Certification, and that O’Reilly is selling a book called “Hadoop Operations”.
So you are going to need an Hadoop admin.
Who are the candidates for the position? Best option is to hire an experienced Hadoop admin. In 2-3 years no one will even consider doing anything else. But right now there is an extreme shortage of Hadoop admins, so we need to consider less perfect candidates.
The usual suspects tend to be: Junior java developers, sys admins, storage admins and DBAs.
Junior java developers tend not to do well in cluster admin role, just like PL/SQL developers rarely make good DBAs. Operations and dev are two different career paths, that tend to attract different types of personalities.
When we get to the operations personel – storage admins are usually out of consideration because their skillset is too unique and valuable at other parts of the organization. I’ve never seen a storage admin who became an Hadoop admin, or any place where it was even seriously considered.
I’ve seen both DBAs and sysadmins becoming excellent Hadoop admins. In my highly biased opinions, DBAs have some advantages:
- Everyone knows DBA stands for “Default Blame Acceptor”. Since the database is always blamed, DBAs typically have great troubleshooting skills, processes and instincts. All those are critical for good cluster admins.
- DBAs are used to manage a system with millions of knobs to turn, all of which have critical impact on the performance and availability of the system. Hadoop is similar to databases in this sense – tons of configurations to fine-tune.
- DBAs, much more than sysadmins, are highly skilled in keeping developers in check and making sure no one accidentally causes critical performance issues on an entire system. Critical skill when managing Hadoop clusters.
- DBA experience with DWH (especially Exadata) is very valuable. There are many similarities between DWH workloads and Hadoop workloads, and similar principles guide the management of the system.
- DBAs tend to be really good about writing their own monitoring jobs when needed. Every production database system I’ve seen has crontab file full of customized monitors and maintenance jobs. This skill continues to be critical for Hadoop system.
To be fair, sysadmins also have important advantages:
- They typically have more experience managing huge number of machines. Much more so than DBAs.
- They have experience working with configuration management and deployment tools (puppet, chef), which is absolutely critical when managing large clusters.
- They can feel more comfortable digging in the OS and network when configuring and troubleshooting systems, which is important part of Hadoop administration.
Note that in both cases I’m talking about good, experienced admins – not those that can just click their way through the UI. Those who really understand their systems, and also understand much of what is going on outside the specific system they are responsible for. You need DBAs who care about the OS, who understand how hardware choices impact performance, who understand workload characteristics and how to tune for them.
There is another important role for DBAs in the Hadoop world: Hadoop jobs often get data from databases, or output data to databases. Good DBA is very useful in making sure this doesn’t cause issues (Even small hadoop clusters can easily bring down an Oracle database by starting too many full-table scans at once). In this role the DBA doesn’t need to be part of the Hadoop team, as long as there is good communication between the DBA and Hadoop developers and admins.
What about Java?
Hadoop is written in Java, and fairly large amount of Hadoop jobs will be written in Java too.
Hadoop admins will need to be able to read Java error messages (because this is typically what you get from Hadoop), will need to understand concepts of Java virtual machines and a bit about tuning them, and being able to write small Java programs can help in troubleshooting. On the other hand, most admins don’t need to write huge amounts of Hadoop code (you have developers for that), and for what they do write – non-Java solutions such as Streaming, Hive and Pig (and Impala!) can be enough. In my experience, good admins learn enough Java to work on Hadoop cluster within few days, there’s really not that much to know.
What about SAN technology?
Hadoop storage system is very different from SAN and generally uses local disks (JBOD), not storage arrays and not even RAID. Hadoop admins will need to learn about HDFS, Hadoop’s file system, but not about traditional SAN systems. Although, if they are DBAs or Sysadmins, I suspect they already know far too much about SAN storage.
So what skills do Hadoop Administrators need?
First and foremost, Hadoop admins need general operational expertise such as good troubleshooting skills, understanding of system’s capacity, bottlenecks, basics of memory, CPU, OS, storage and networks. I will assume that any good DBA has these covered.
Second – good knowledge of Linux. Especially for DBAs who spent their life working with Solaris, AIX, HPUX. Hadoop runs on Linux. You need to learn Linux security, configuration, tuning, troubleshooting and monitoring. Familiarity with open source configuration management and deployment tools such as Puppet or Chef can help. Linux scripting (perl / bash) is also important – you need to build a lot of your own tools here.
Third – Hadoop skills. No way to avoid this :) You need to be able to deploy Hadoop cluster, add and remove nodes, figure out why a job is stuck or failing, configure and tune the cluster, find the bottlenecks, monitor critical parts of the cluster, configure name-node high availability, pick a scheduler and configure it to meet SLAs, and sometimes even take backups.
So yes, there’s a lot to learn. But very little of it is Java, and there is no reason DBAs can’t do it. However, with Hadoop Administrator being one of the hottest jobs in the market (judging by my LinkedIn inbox), they may not stay DBAs for long after they become Hadoop Admins…
Any DBAs out there training to become Hadoop admins? Agree that Java isn’t that important? Let me know in the comments.
19 comments on “Hadoop FAQ – But What About the DBAs?”
Leave a Reply
You must be logged in to post a comment.

Hi Gwen! I prefer DBA = Does Basically Anything. Agree on the Hadoop Admin role, and I am proceeding full speed ahead into the world of Hadoop! Excellent article.
Hi Gwen,
very interesting and helpful article!
I encounter some worried DBAs myself who want to know exactly what you addressed here: How is Hadoop affecting my job role? You make it clear that it is actually a great chance for them :)
Thank you &
Kind regards
Uwe
Hi Gwen,
I can relate to the Default Blame Acceptor :-) .I encounter it everyday.Great Article.Hadoop sounds interesting.I will start doing some reading for this.
Thanks,
Levin
Excellent one, many thanks!
Great Article. Few years ago I was started saying, there is should be Hadoop DBA position in the company that using Hadoop. As same as for any other databases out there.
Thanks for sharing Gwen. I thinking about myself. Should I jump into magical Hadoop world or stay an Oracle DBA for a while.
My current thoughts are:
– There are still so many things waiting for me in the Oracle DBA space. I probably would like like to cover some of them before jumping somewhere else …
– BIG Data arrived so rapidly that I got some associations with other rapidly arrived buzzwords before Hadoop (e.g. SOA). Would Hadoop/Big Data stay for good? Or those may disappear in few years?
– No matter what I think there are going to be enough work for both DBA and Hadoop Admin in the datafication age :)
Yury
Hi Yury,
Nuno Pinto De Souto gave a similar sentiment in the Big Data SIG.
First of all, I agree that Oracle DBA is a huge world and there is always more to learn. I can easily point to areas where I can learn or improve myself. I also agree with Nuno who said that many skills are timeless and always needed. I totally agree on that.
Is Hadoop here to stay? If I could tell how market trends work, I would be a far richer woman. I’m just a simple DBA :)
I work with Hadoop because I love it. I love the brilliant simplicity of the platform, the rich eco-system, the flexibility, the tools. I feel very creative when I work with Hadoop, much more so than working with Oracle. But this is personal – everyone has his own favorite tools.
I try to encourage DBAs to learn Hadoop for two reasons:
1. Maybe some of them will love it as much as I do. I want to spread the joy!
2. Hadoop is being actively adopted by many organizations. Hadoop Admins are necessary. Someone has to do the job. I’m trying to encourage more people to study Hadoop, so the job market will become a bit more balanced.
Will Hadoop go away? Personally, I see it as a real solution to real problems. I don’t see it going away any time soon. It will probably become boring. When I started my career, XML was really hot – everyone was talking about it, learning new technologies around XML, re-designed processes, etc. Now, people just use XML and don’t talk about it much. The time I spent working on DOMs and XSLT and all was not wasted at all. It was fun back then, and is still sometimes useful.
Knowledge is never wasted, even when the trend is over.
Hi Gwen,
great post!
It is true that DBAs are already used to be in the center of attention – working with sys admins, network admins, storage admins, developers and vendors to solve complex technical challenges.
I also find Hadoop admin to make sense for DBAs as a career move especially for the “infrastructure” DBAs – who focus more around DB infrastructure, not schema design. Since there currently is a shortage of such admins, it might also be financially beneficial – but the key factor should be just passionate to play and own cool, new technologies…
Great post.
When I explain to folks what a lowly DBA does, I tell them
“I don’t get to drive the train, but when it jumps off the track, guess who’s phone starts ringing itself off his desk!”
Pingback: Hadoop FAQ – But What About the DBAs? | Big Data Press
Amazing Article Gwen which describe what DBA future could be, its really Useful and good To share .
You might find my recent posting interesting..
http://aswaniv.wordpress.com/2013/01/11/big-data-nosql-and-the-dba/
Pingback: Big Data: NoSQL & the DBA « public void killTime () {
Pingback: DBA, Grow Thyself – Moving and Shaking in the Era of Data Dominance | Steve Karam :: The Oracle Alchemist
Excellent inputs. One Question: does the the experience of RAC DBA, in your opinion, be helpful while administering the Hadoop clusters….
Here are few examples for when my RAC DBA experience was helpful when administering an Hadoop cluster:
1. When setting up highly available namenode, you need to configure STONITH method. As a RAC DBA, you probably know all about STONITH, why its important and can easily choose the correct configuration.
2. In Hadoop, troubleshooting often involves figuring out which specific node is having trouble. RAC DBAs are pretty good at drilling down on randomly occuring issues to find the faulty server
3. Troubleshooting also involves correlating messages from large numbers of logs and machines. RAC DBAs are usually experts on that too.
In general, RAC experience is experience in distributed systems – which is critical for Hadoop administration.
Very good information.
I have 4+ on Linux/Hadoop curently with 7LPA, is it fair enough to ask for 18LPA?
Hi Gwen
I’m fresher but i want learn hadoop admin please give me some suggestions
Is it good for me????
Hi Gwen,
Very good & informative article.
I have 7+ years of exp as iSeries developer. Would Hadoop admin career helpful for developer as well?
Thanks in advance!!