Currently browsing NoSQL

THE WORLD DISCUSSES #PYTHIAN ON TWITTER. HAVE A QUESTION? USE OUR HASHTAG AND ASK AWAY.

Oracle’s Big Data Machine – Details and Musings

Oracle announced the Big Data Appliance on Monday morning keynote. Many people, me included, were long waiting for this to happen. Others didn’t think it will ever happen. So naturally, there is a lot of buzz and excitement around the new device in Open World. The keynote announcement was very short on details and certainly did not satisfy my technical curiosity. So I went to a few presentations to hear what exactly is included in the offering.
Read the rest of this entry . . .

Oracle Big Data Appliance — Oracle’s Bold Move Into Big Data Space

Oracle Big Data Appliance (BDA) is being announced at the Oracle OpenWorld keynote as I’m posting this. It will take some time for it to be actually available for shipment and some details will likely change but here is what we have so far about Oracle Big Data Appliance.

A rack with InfiniBand, full of 2U servers similar to Exadata Storage. No flash storage needed so couple sockets and a dozen of disks will do. Maybe more ram than Exadata storage cells themselves. I suspect you could have as many servers as you want in a configuration but since Hadoop clusters are usually dozens and more nodes, full rack seems reasonable with about 20 Hadoop compute nodes to start with. Real deployments should easily go into multiple racks stacked together.

Low latency, high bandwidth communication is critical for fast data loading and later data processing with Hadoop so InfiniBand will be there — same Exadata/Exalogic-like platform.

Oracle should also have its own NoSQL engine — Oracle NoSQL Database. If you know existing Oracle products, Berkley DB seems to be a reasonable foundation to power Oracle’s new NoSQL engine.
Read the rest of this entry . . .

Trends and Data – Notes from Strata NYC 2011

I’ve attended the Strata conference in NYC last week. Its been many years since I’ve last attended a conference without presenting in it. On one hand, attending only makes for a far more relaxed experience. On the other hand, I missed having random people come up to me and talk about my presentation. I decided to attend the conference since it is considered the foremost data science conference. And I was very much interested in what those data scientists are up to.

Good data scientists  combine the abilities of business analysts, statisticians and software engineers. They have the skills, the tools and the mandate to mine and analyze all the data the organization collects to deliver valuable insights to the business and data-based features to the customers of the business. In addition, it is considered the hottest job around. Of course, it is data scientists who mined job postings and job moves to come up with this conclusion, so maybe take it with a grain of salt.

Data-scientists normally work with very large amounts of data, both structured (the enterprise data warehouse) and unstructured (web server logs, blog posts). Since I’m a big fan of big data, I was very curious to see what those data scientists care about.

So, in no particular order – stuff data scientists like:

Read the rest of this entry . . .

Hadoops Everywhere

We don’t pay enough attention to Hadoop.

By “we” I mean DBAs, the rest of the world is paying plenty of attention to Hadoop. Recently, I started asking my customers and fellow DBAs about Hadoop adoption in their company. Turns out that many of them have Hadoop. Hadoop shows up in large companies and small ones, in established industries and in startups. Its everywhere.

The way Hadoop shows up in all companies, and the way DBAs don’t pay Hadoop much attention, reminds me a lot of how MySQL started showing up in the enterprise. It didn’t start by DBAs showing up one morning and telling their managers:
“There’s this new open source database. Its not as stable as Oracle and it doesn’t have all the features we need, but man – its going to save us tons of money, and its pretty simple to manage.”

Nope, this never happened. What happened instead is that developers learned about MySQL, and it seemed to them like an excellent way to go around this whole DBA thing. They could install it themselves, learn how to use it in a week and become happy and productive. Without ever having to discuss their schema, data model, requirements, capacity planning, availability, backups and all the other things that DBAs want to talk about.

By the time the application came out of developement and had to be deployed in production, MySQL was a done deal. No one is going to re-write the app just because the DBAs don’t know MySQL. Sometimes the Oracle DBAs were forced to learn and admin MySQL, but more often it was considered “not a database” and left for the sysadmins to manage, while the DBAs continued to pretend that the entire world is written by Oracle.

So thats what Hadoop adoption looks like now – Its usually introduced by the developers and administered by sysadmins, while DBAs continue to pretend it doesn’t exist or doesn’t matter. When pressed, some DBAs will even insist that all this “big data” thing can and should be done in a database, but the developers are too ignorant or lazy to work with a proper RDBMS.

I think the day arrived when, just like DBAs can no longer ignore MySQL, we can no longer ignore Hadoop either. So lets talk about it.

Read the rest of this entry . . .

Log Buffer #232, A Carnival of the Vanities for DBAs

These days products based on the database technologies are getting hatched with the speed of light. From the giants like Oracle and Microsoft to the start-ups, there is an army of products which is growing by the week. It’s become hard to remain abreast of all these technologies, but thanks to blogs, we get the latest and greatest news. This week’s Log Buffer in its Log Buffer #232 has lumped some interesting posts together.
Read the rest of this entry . . .

PgEast 11 The End Game

Well the last busy day here in The Big Apple again a number of very good technical talks. It is not often that the developer of a key piece of a technology gives an intro talk so I grasped it when it came up. Robert Haas gave a very informative talk on the theory behind WAL (Write Ahead Logging) and how it is implemented on PostgreSQL as compared to other DBs. His talk never ventured into the neither world of techno-babel but gave just enough of the technical side to get the understanding out. In the second part of his talk Robert focused on a introduction of the ‘Buzz’ words of WAL that one might have to deal with. This was both very entertaining and armed one with a real understanding of WAL.

I next sat in on ‘Little Jim’ Mlodgenski’s ‘Scaling with GridSQL’ talk. Another great technical talk that did not get bogged down in little details. Jim illustrated how GridSQL leverages the Power of Nodes to create a scalable parallel query data ware-house by creating a controller that will split off most of a large query to the different nodes in a cluster take the results from these nodes and then applies the final touches. Jim clearly demonstrated that with simple aggregation queries one seen a linear gains in performance for each node added to the cluster. With more complex queries there was an exponential gain for the first few nodes but one sees a fall of after only 8. Jim was very open about the pitfalls of this form of scaling (eg backup can be problematic) but it a very good solution for quick scalable data-ware housing.

The final talk of the conference was Jake Luciani’s talk comparing Apache’s Casandra to PostgreSQL was a very good introduction to this rather novel No-SQL DB. Think of a ring of peer to peer hash tables that work together to scale, provide no single point of failure, automate replication and implement tunable consistency. Its basic concept is the opposite or the RDBMS ‘Store Many! Read Once’ which makes some sense when used in such situations as large blogs, photo libraries or even diverse catalogs. Jake also introduced us to something he called CQL a query language for thew No-SQL DB

The conference ended with one of the better open forums I have attended I am sure next year will be much better.

Hopefully I will be able to make it next year as well

Day one at PGEast 11

I guess I brought the snow with me to Ne York as I awoke to a nice 10cm dump. Anyway today would best be described as a day of ‘Disruptive Tech’

I first attended Kevin Kempters intro into PorstgreSQL High Availability. A very well balanced presentation that gave a very good overview of what is available out of the box for both Warm Standbys and Hot Standbys how they can be very easily implemented. He also gave a quick overview of other tools that can be used including Slony for detailed fail-overs and PgPool for load balancing and relication. Not very disruptive but it does show that Pg is on par with most of the heavy hitters such as MySQL and Oracle.

The keynote this year was by Ed Boyajian the CEO or EnerpriseDB and he gave an big picture of the DB in terms of market which is a whopping 26$ Billion a year in the US alone of which the the two five players have 90% of the market one having more than half.

He made the comparison between his time at Red Hat when there was a huge untapped market much the same situation exists today for PostgreSQL as it represents a ‘Disruptive player’ in the game is it is the last open source DB out there. In other words we can only grow in the future.

To continue on with my Disruptive theme I also attended B. W. McAdams and Justin Dearing’ s two talks on Mongo. Mongo is true disruptive technology as it is a NON-SQL Database. For an old timer relational chap I was a little skeptical. It is hard to thing of a DB without SQL, Schema, Joints or triggers but they made a good case for it. It is all a question of building the correct tool for the Job. Traditional relational DB where never intended to be used to create Blog web sites and as many of us have found out they might not be the ‘right’ tool. Mongo with its ‘Document’ orientation solves many of the ‘Blog’ problems very elegantly. Mongo is just not for Blogs both speakers gave a number of examples of its application for example in a quickie app that displays the nearest Subway station to you and one that acts as the cache for a large PostgreSQL DB

I also has to pleasure to hear a first time speaker Vanessa Hurst who presented on the topic of ORMs (Object Relational Mappers) and the problems they cause for DBs. It was good to hear some of these issues and she made the very good point that it is always a compromise between speed to market and long term goals. You might get an ORM db out in two months but in one year form now your DB may not work anymore because of single object files, lack of planning for scalability or just poor design that was forced upon the team from the ORM.

Well off to enjoy the ‘Le Comte Ory’ at the Met for me tonight

Cheers

Log Buffer #208, A Carnival of the Vanities for DBAs

Welcome to Log Buffer, the weekly round up of news and happenings in the database world.

We’re planning our publishing calendar for 2011. Happy to announce that we’ll have a few guest hosts in the New Year. Don’t forget if you’d like to host or edit a future edition of Log Buffer on your own blog, send a note to the Log Buffer coordinator.

We’ve had several contributions of favorite reads from the team this week. Enjoy this issue, Log Buffer #208.

Gwen Shapira’s picks:

Iggy Fernandez uses GraphViz to visualize his explain plans – he thinks it makes them easier to read, but Gwen’s not sure she agrees. In the comments, Tim Hall and Charles Hooper give a lot of information on how to read explain plans correctly and are worth reading.

Jonathan Lewis, on Oracle Scratchpad, blogs about optimizer issues with collection types and suggests a work-around.

Asif Momen updates that Oracle released a nifty little tool for looking up DBA views and background processes.

Jared Stills ran into interesting date format issues while working on his latest book.

Pythian’s Alex, Christo and Dan were blogging live from UKOUG 2010. It looked like they were having so much fun, I’m not sure why they call it work! Welcome home, Paul and team – you made it, despite the snow.

Vadim Tkachenko blogs about a very scary InnoDB bug that can corrupt your data and crash your database. It can even allow your users to do it to you! Read and take steps to protect yourself.

In DB2 news, Fahd Mirza suggests:

Henrik Loeser expounding as how to build a full text index on PDF documents in DB2.

Raul F. Chong gives the chance to experience the next version of DB2 today!

Willie Favero appreciates the security offered by the DB2 10.

Edwin Sarmiento writes his second post in a series on HADR, further building on his point that a good HADR strategy is more than just the underlying technology.

Guiseppe Maxia, the Data Charmer, starts a lively discussion on MySQL forks, and points out 5 arguments in favor of them.

Hard to believe it’s December already.

Log Buffer #206, A Carnival of the Vanities for DBAs

Welcome to Log Buffer, the weekly news blog about blogs in the datasphere… As we kick off Log Buffer #206, our own Gwen Shapira shares a few of her weekly favorites:

Oracle:

Arup Nanda posted an excellent script on how to summarize backup information from the rman catalog. He also posted a tool for automatically purging time-based partitions.

Pythian’s resident Exadata expert, Marc Fielding posted links to the latest recording of his Exadata webinars.
Read the rest of this entry . . .

No Silver Bullet – Sharding and MongoDB

FourSquare, the location based social network, suffered from extended outage yesterday. They explained the causes in a blog post, which caused much discussion around the web.

Here’s the gist of the analysis: FourSquare are using MongoDB, which is a sharded database. Data is split between nodes based on a shard key, usually the User ID or something similar. One of the shards became overly loaded. After failing to resolve the issue in other ways, FourSquare decided to add another shard to share the load. This caused the entire cluster to fail.

It is easy, and many have done it, to point the finger and blame MongoDB for the mess – after all, it is likely their bug that caused the entire cluster to fail when adding a shard.

I thing that the root cause is still that sharding is inherently difficult. Although many NoSQL databases market themselves as the ultimate solution, having a silver bullet you don’t completely understand is actually more dangerous than solving the problem yourself. This is why many companies that are successful with NoSQL solution do so by employeed one of the developers of the solution.

I’ve seen a lot of shards in the last 10 years. I’ve worked for SaaS provider, I have a social network provider as a customer, my husband works for another social network provider. Here are few lessons that may help you avoid the more obvious mistakes:

  1. Once the system is overloaded, it is too late to add shards. Because re-balancing the load between the shards creates more load and usually lots of locking is involved. Your planning and design must include another way of reducing load on a database server, and of course recognizing load growth and adding shards before it is too late.

  2. It is better to design the system with a lot of queues, connection pools and caches between the app servers and the databases. Contrary to popular belief, app servers can be better than databases at handling excessive load.
    Database is all shared resource, there are lots of locks involved that can cause even mild contention to escalate into complete hangs. The database also can’t prioritize requests and serve partial data.
    If you can control the load on the app server, by using queues, connection pools and caches – you can prioritize the right requests and decide what is the right feedback to give to your users. Graceful degradation is all about handling the excessive requests in a smart way that only the application is capable of doing.
    Doing this also means you can add database shards, because the database does not get too loaded.

  3. Capacity planning is probably the biggest problems in sharding. The planning is necessary because once a shard is loaded, its too late. I’ve seen two ways of spliting data between shards and each way has its own planning problems:


    You can add shards sequentially: Start with one server, when it hits 70% load, add another and start creating new users/customers/whatevers on the new server. When it gets to 70% load, add another. The upside is that very little planning is needed. The downside is that you can get easily screwed by various usage patters. Are new users more or less active than existing users? What if one of the existing users becomes much more active all of the sudden?


    Another way is to use consistent hashing – you generate a random key for each user and use this to distribute users between your existing servers. When adding new server, you need to move few users from each existing server to the new one to rebalance the load. The upside is that you can take load from existing server, the downside is that adding servers causes anything from few minutes to few hours of high response time variable on all existing servers. Also, the problem of taking equal portion of the load from each server is more difficult than it sounds.


  4. You sharded the DB, but what about your app servers?

    You can decide that all app servers will serve all traffic, connecting to the different shards as needed. This is usually the right thing to do in terms of app load balancing and redundancy.
    But your app will need to know how to handle one failing shard without impacting other customers. The other downside is that I’ve seen the amount of network traffic this generates overwhelms the LAN causing performance issues all over the cluster.

    The alternative is to have small group of app servers handle every shard specifically. This is suboptimal in terms of resource usage, but can be easier to manage. Issues with one shard will only impact very well defined subset of your product.

    Regardless, the app must know how to handle users that move from one shard to another
  5. If your app does much more reading than writing, having few read-only replica for each shard is a good solution for farther controlling load.
  6. Regardless of how you do the sharding, you will want a way to manually rebalance servers. Sometimes you *know* that a load issue will be resolved by moving a specific group of users to a new server. It will certainly happen. Make sure you have tools to do it.

Good luck sharding and don’t believe anyone who tells you their tool makes it easy.

Start NowWith Pythian - database design, management and emergency handling capabilities...

Live Updates

pythian: @ghemant @pythian love your #hemantgiri
more



Testimonials

  • Serge Racine

    DBA, Brookfield Energy

    We are very satisfied by the service given to us by Andre and Shakir in support of our recent data quality and reorganization initiative.... more



Social links powered by Ecreative Internet Marketing