Cloudera Challenge 2014

Jul 21, 2014 / By André Araújo

Tags: , , ,

Yesterday, Cloudera released the score reports for their Data Science Challenge 2014 and I was really ecstatic when I received mine with a “PASS” score! This was a real challenge for me and I had to put a LOT of effort into it, but it paid off in the end!

Note: I won’t bother you in this blog post with the technical details of my submission. This is just an account of how I managed to accomplish it. If you want the technical details, you can look here.

Once upon a time… I was a DBA

I first learned about the challenge last year, when Cloudera ran it for the first time. I was intrigued, but after reading more about it I realised I didn’t have what it would be required to complete the task successfully.

At the time I was already delving into the Hadoop world, even though I was still happily working as an Oracle DBA at Pythian. I had studied the basics and the not-so-basics of Hadoop, and the associated fauna and had just passed my first Hadoop certifications (CCDH and CCAH). However, there was (and is) still so much to learn! I knew that to take the challenge I would have to invest a lot more time into my studies.

“Data Science” was still a fuzzy buzzword for me. It still is, but at the time, I had no idea about what was behind it. I remember reading this blog post about how to become a data scientist. A quick look at the map in that post turned me off: apart from the “Fundamentals” track in it, I had barely idea what the rest of the map was about! There was a lot of work to do to get there.

There’s no free lunch

But as I started reading more about Data Science, I started to realise how exciting it was and how interesting were the problems it could help tackle. By now I had already put my DBA career on hold and joined the Big Data team. I felt a huge gap between my expertise as a DBA and my skills as a Big Data engineer, so I put a lot of effort in studying the cool things I wanted to know more about.

The online courses at Coursera, Edx, Stanford and the like were a huge help and soon I started wading through courses and courses, sometime many at once: Scala, R, Python, more Scala, data analysis, machine learning, and more machine learning, etc… That was not easy and it was a steep learning curve for me. The more I read and studied I realised there was many times more to learn. And there still is…

The Medicare challenge

But when Cloudera announced the 2014 Challenge, early this year, I read the disclaimer and realised that this time I could understand it! Even though I had just scratched the surface of what Data Science is meant to encompass, I actually had tools to attempt tackling the challenge.

“Studies shall not stop!!!”, I soon found, as I had a lot more to learn to first pass the written exam (DS-200) and then tackle the problem proposed by the challenge: to detect fraudulent claims in the US Medicare system. It was a large undertaking but I took it one step at a time, and eventually managed to complete a coherent and comprehensive abstract to submit to Cloudera, which, as I gladly found yesterday, was good enough to give me a passing score and the “CCP: Data Scientist” certification from Cloudera!

I’m a (Big Data) Engineer

What’s next now? I have only one answer: Keep studying. There’s so much cool stuff to learn. From statistics (yes, statistics!) to machine learning, there’s still a lot I want to know about and that keeps driving me forward. I’m not turning into a Data Scientist, at least not for a while. I am an Engineer at heart; I like to fix and break things at work and Data Science is one more of those tools I want to have to make my job more interesting. But I want to know more about it and learn how to use it properly, at least to avoid my Data Scientist friends cringing away every time I tell tell I’m going to run an online logistic regression!

7 Responses to “Cloudera Challenge 2014”

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>