Big Data on Microsoft Azure – HDInsight
Tags: Oracle E-Business Suite, Technical Track, R12, Technical Blog, Pythian, Recovery, E-Business Suite
IntroductionThe best definition you going to find for data is that data is the new oil in today’s world. Starting from that, we can define a new horizon and a new way of looking at how we treat and work with data. This process has become extremely challenging and compelling since our data spectrum has changed from a structured to a non-structured form. Now, different features/products have come along to help us to handle these humongous sets of data. Companies that want to demonstrate a competitive advantage over others need to address one of the hardest IT tasks: customer behavior. This is now the hottest and most challenging job for data scientists and the reason is that they must know how to wrangle, massage and conform vast chunks of data before any AI or ML algorithm. However, it is not only that. Companies are missing a big point when designing and implementing their Big Data solutions. We usually describe Big Data as a storage and analysis of large and or complex data sets using a series of techniques including but not limited to: NoSQL, MapReduce and Machine Learning. But trusting and focusing only on those could blind your decisions since the results miss out the qualitative insights of your company vision. That is where “ Thick Data” comes into play. The key here is to bring more value to the quantitative data that you have stored in your Big Data solution. With research, surveys, questionnaires, focus groups, interviews, journals, videos, social media analyses and so on, this is going to help your company thrive by bringing more assertive decisions to support you in understanding not only your key audience but also your customers' behavior.
HDInsightSince 2013, Microsoft has been helping their customers achieve the best of the Big Data ecosystem. With their partnership with Hortonworks distributor, they expanded their capabilities and were able to enrich their solutions on the Big Data spectrum. HDInsight is a fully managed, open-source analytics service for enterprises that want to use the Hadoop technology stack to solve and tackle Big Data problems. The platform offers a unique set of products that are entirely managed by Microsoft Azure. In a nutshell, Azure HDInsight is a cloud distribution of Hadoop components from the Hortonworks Data Platform – HDP, which makes it easy, fast and cost-effective to process a massive amount of data in a hyper-scale environment. There are several reasons why companies are looking for managed Big Data solutions nowadays. Mainly because of the low-cost and scalable possibility, security and compliance, monitoring, productivity, extensibility, as well as the most important reason: the global availability of the selected products.
Cluster typesHDInsight offers different cluster types to address different issues that you may struggle with in your business. They have an hourly-based approach to billing and in a decoupled architecture. That means you can process the data you want and afterwards destroy the cluster, saving the data inside of the Azure Blob Storage or Azure Data Lake Store. The data will remain there without being removed or changed once the process is over. Most of the companies that use the HDInsight flavor adopt this approach to achieve blazing fast performance and at the same time, reduce their costs with the infrastructure. In an on-premises environment, we are not allowed to turn off the computing part, since the HDFS and the processing area are coupled by using a PaaS (Platform-as-a-Services) solution. This solution makes it easy to work around this and also gives you endless possibilities to use a set of tools to help you to manage, orchestrate and monitor the entire data workflow. HDInsight offers the following cluster types: - Apache Hadoop - Apache Spark - Apache HBase - R Server - Apache Storm - Apache Interactive Query (Hive 2.0) - Apache Kafka * HDInsight is the only PaaS platform that offers this amount of fully-managed cluster types in a cloud environment.
Common scenarios by cluster typeIn this section, we are going to walk through the cluster types and review the best-fit solution as well the everyday-use cases scenarios for them.
- Apache Hadoop
- Apache Spark
- Apache HBase
- R Server
- Apache Storm
- Interactive Query (Hive 2.0)
- Apache Kafka
Learn more about Pythian's services and solutions for Microsoft Azure.