Microsoft Analytics Platform System: Name Overhaul in Big Data War!

May 21, 2014 / By Mohammed Mawla

Tags: , , , ,

I had the chance to attend a course about what used to be called Parallel Data Warehouse (PDW). PDW was introduced few years ago with the offering of SQL Server 2008 R2 Parallel Data Warehouse , something very few people could get their hands on. The appliance used to cost a ton of money, too many parts and only offered by HP or DELL in a pre-configured setup.

With SQL server 2012, Microsoft made many enhancements slashing the size of the appliance to almost half, and consequently the cost, and improving the performance as well.

Just while I was attending the course, Microsoft announced a name change and the introduction of new fancy brand name: Microsoft Analytics platform. CEO Satya Nadella announced the new platform with other products as well. I suggest reading this blog written by Satya himself

I’m sharing some of my personal (again, personal) opinions here about the platform and the appliance. So let’s take a step back and explore some of the basics here.

What is PDW (aka APS)?

Microsoft loves those three letters acronyms, although certifications may be an exception. Not to go to great lengths, it’s a Massively parallel Processing (MPP) “APPLIANCE” that is optimized for large scale enterprise data warehouses. The appliance is a logical entity that encompasses servers , switches , storage enclosed into a rack. Think of it as your typical environment of servers, switches and storage all brought together in one huge rack appliance. The idea behind the appliance is simple: We do all the dirty work for you and give you something “optimized” to handle huge amounts of data without the hassle of configuration, tuning and license worries; Of course the word “optimized” is according to Microsoft terms.

PDW MPP

The appliance is not only about SQL server but it also incorporates Hadoop and an engine, PolyBase, to simplify talking to Hadoop using conventional SQL server T-SQL and can also tap HDsinghts to reach out to data stored in the cloud.

The appliance is only available through few vendors, used to be only HP and DELL but now also includes Quantas. Please check resources section for more information about Microsoft Analytics Platform.

The following video vividly tries to demonstrate the usage of APS: https://www.youtube.com/watch?v=-FGiAHyRRIA
Why PDW/APS appliance ?

Customers with huge amounts of data that also spans heterogeneous sources want always to get meaningful information out of that data. The more the data they have , the harder and longer the time to extract key information. Appliances are tuned machines with massive resources to help analyze , aggregate and join data much faster. Conventional SMP machines can work up to a certain level with much needed tuning and optimization that may not always work. vendors take this tuning and optimizations responsibility and present you a sleek machine that is supposed to overcome multiple design and resources limitations. Some of the examples of existing appliances are Oracle Exadata , Teradata Data Warehouse Appliance and IBM PureData and Netezza.

Are you saying that conventional SQL server setup can’t achieve this? Not entirely. Think of this as car upgrades where they may be based on the same chassis but high-end models have more powerful engines, features and performance. Although SQL server keeps bringing enhancements like updatable clustered columnstore indexes and in-memory OLTP in SQL server 2014 , PDW/APS appliance differs from conventional SMP in the following areas:

  1. PDW/APS appliance is a bundle of hardware and software offering customized to scale out. You can add and remove(much harder though) nodes to scale out to your data needs.Each “node” runs on separate server with seperate SQL server and hardware resources and managed by a “control” node to distribute the workload. You can read about APS Solution Brief here
  2. You can’t buy a PDW SQL server licence and install in your environment and you can not even assemble the product even if you have the blueprint , you just get it from one of the vendors and plug-n-play it.
  3. PolyBase plugs in Hadoop. You may be able to connect existing SQL server with Hadoop but PolyBase provides easy to use T-SQL functions to extract data from Hadoop providing almost immediate ability to query Hadoop without a long learning curve.
  4. Many conventional SQL server features are suppressed. Yes , I can see your raised eyebrows but the idea is that Microsoft wanted to remove areas that can introduce bottlenecks such as CLR , Extended dlls and even SQL server agent. If you need to do something that can’t be done inside PDW , such as scheduled jobs, then move it to another tier.

Where does PDW/APS fit?

The cost of buying and running the appliance suggests that it’s not for everyone. The available resources and bundling Ploycase to connect to Hadoop shows that it’s for an enterprise with huge and heterogeneous amounts of data that is spread around. Bringing this data together with least customization is the goal of the appliance.

PDW/APS can help bring data together from following areas:

  • Social apps
  • Mobile
  • Sensors
  • Web
  • Other data sources such as RDBMS

The appliance fits a segment of industries, notably:

  • Retail
  • Healthcare
  • Financials
  • Manufacturing
  • Social media

Microsoft has some case studies about clients deploying PDW and realizing up to 100X performance gain. Here are the case studies:

  1. MEC -Media Firm Uncovers the Value of Digital Media with Parallel Data Warehouse
  2. The Royal Bank of Scotland – Leading UK Bank Gains Rapid Insight into Economic Trends with Analytics Platform System
  3. Progressive Insurance – Progressive Data Performance Grows by Factor of Four, Fueling Business Growth Online Experience

I believe Microsoft has to do more to get the appliance to more customers and I think the brand rename is part of this push. Training, support and deployment materials are also needed since there is not much resources online.
Is PDW worth it for clients?

Microsoft competes against TeraData, Netezza and Oracle Exadata. Cost is a very big factor: Licence and support. MS tends to do fairly well with MS shop customers and SQL server base clients. However, first version of PDW tended to be so expensive and bulky but current appliance is almost half price and half size than used to be. Expertise seems to be low still and this is what MS is working on.

Microsoft word is that instead of investing too much and too long on creating the same technology by trial and error and spend much time on tuning , here’s a working appliance that we have tuned it for you and just focus on your business. Per following chart, MS claims the cheapest price per TB compared to other vendors.

image

Regardless , the appliance makes more sense to SQL server clients even more ; however , I still see clients sticking to solutions from vendors of the main RDBMS technology they run since importing data will easier and learning curve will be less steep.

The appliance will make a case for mid to large enterprises with new Terabytes of data each month including unstructured data. SMP SQL and APS may correlate in the region of few to tens of terabytes but once we talk about hundreds of terabytes of data including unstructured data then APS starts to make sense.
PDW Against Cloud?

There are few reasons clients may opt for an in-premise appliance, including :

  1. Some data is sensitive to trust putting in the cloud.
  2. The amount of data is huge to upload to cloud.
  3. Cloud is not mature yet to support all features.

Most cloud vendors are not yet ready to offer such expensive appliance. Amazon, however, has something similar called Redshift. They actually make a case against on-premise solution saying you don’t need those bulky expensive appliances that can break, while you can use Amazon cloud solution and pay “less”. However, there are few points :

  1. MS appliance tries to play in the lower segment of cost.
  2. You don’t need lots of DBAs to manage. In fact, I was surprised that MS took out most of the the parts that can introduce problems with the appliance : no CLR, no extended modules, many other features were disabled. They applied best-practices and locked it so you can not temper with the appliance. I was told that the only thing that can cause APS to fail is a hardware failure and we already have redundancy so a failure is even less probable
  3. Not everyone wants their data in cloud , mostly for security. I was told about a customer who wanted to destroy , with a hammer, the hard drives after testing the appliance. It took few weeks to zero write the drives , few times.
  4. Transferring a lot of data to public cloud is not that fast , unless you are already hosting your data in same location such as on Amazon.

APS VS Hadoop

Hadoop excels in non-structred data such as text , sensors data , web crawling ..etc and whether you already have existing Hadoop lusters running or plan to , you may still have valuable relational data stored in your existing SQL server instances. APS makes it easy to bridge the gap between the two and use T-SQL to join the data from the two sources without worrying much about the design of Hadoop cluster.

APS region share

USA is top then EU then Asia.
APS and SQL 2014

The appliance still runs a customized version of SQL server 2012 , no word yet when SQL 2014 will be introduced. Upgrading the cluster is supported but it is not something end-customer can do yet.
PDW support

Many cases are still only supported by Microsoft Customer service and support (CSS) like failing back and downsizing but they are trying to automate some tasks to be done by experienced DBAs.
DBA role with APS

Don’t hold your breath! As outlined before, much of the configuration and optimizations are done for you in what MS believes is the best balance. You can not do many of the tasks a DBA does like changing Tempdb configurations, max degree of parallelism, etc. That’s the idea behind the appliance in fact. Focusing on bringing the data to the appliance and modifying your design to extract information.

When I did work with the appliance , I didn’t find it particularly complex from an operations point of view. Most of the work is about bringing data in, figuring out the best way to join data together without shuffling data around and that’s about understanding the data and business logic.

You can help in the following areas though:

  • Exporting> importing data from existing data sources into APS
  • Advise any changes to the design and business logic to comply with appliance requirements
  • Advise how to bring data together for analysis.
  • Design and implement data extraction practices.
  • Troubleshoot analysis routines and schedules.
  • Troubleshoot long running queries and schedules. APS has a web portal where you can see all running processes; you can also use DMVs to get this information and some of them are unique to APS.

Summary

APS appliance targets customers with huge amount of data that span heterogeneous sources who need read to plug solution.

With the new brand of the PDW appliance, Microsoft is getting more serious about big data and analytics. However, many vendors are well-established here and it’s still a long run in a market that’s expected to explode if it has not started yet.

Happy reading!

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>