Posts Categorized: Pythian

Small Files on MapR-FS

One of the well-known best practices for HDFS is to store data in few large files, rather than a large number of small ones. There are a few problems related to using many small files but the ultimate HDFS killer is that the memory consumption on the name node is proportional to the number of…

Cloudera Challenge 2014

Yesterday, Cloudera released the score reports for their Data Science Challenge 2014 and I was really ecstatic when I received mine with a “PASS” score! This was a real challenge for me and I had to put a LOT of effort into it, but it paid off in the end! Note: I won’t bother you…

Comparing CPU Throughput of Azure and AWS EC2

After observing CPU core sharing with Amazon Web Services EC2, I thought it would be interesting to see if Microsoft Azure platform exhibits the same behavior. Signing up for Azure’s 30-day trial gives $200 in credit to use over the next 30-day period: more than enough for this kind of testing. Creating a new virtual…

Essential Hadoop Concepts for Systems Administrators

Of course, everyone knows Hadoop as the solution to Big Data. What’s the problem with Big Data? Well, mostly it’s just that Big Data is too big to access and process in a timely fashion on a conventional enterprise system. Even a really large, optimally tuned, enterprise-class database system has conventional limits in terms of…

vCPU sharing in EC2: HVM to the rescue?

aws-cpu-hvm-600x235

I’ve been doing some testing to clarify what a vCPU in Amazon Web Services actually is. Over the course of the testing, I experienced inconsistent results on a 2-thread test on a 4-vCPU m3.xlarge system, due to the mislabeling of the vCPUs as independent single-core processors by the Linux kernel. This issue manifests itself in…

Page 9 of 207« First...7891011...203040...Last »