Hadoop is the open source software framework at the heart of much of the Big Data and analytics revolution. It provides solutions for enterprise data storage and analytics with almost unlimited scalability. Since its release in 2011 it has rapidly grown in popularity and a strong ecosystem of distributors, vendors and consultants has emerged to support its use across industry.
At its core, Hadoop is an Open Source system, which, among other considerations, means it is essentially free for anyone to use. However, the requirement for it to be aligned to the needs of individual organisations has resulted in the emergence of many commercial distributions. These generally come packaged with support or additional features designed to streamline its deployment or allow users to build additional analytics, security or data handling into their framework.
Competition in this market is fierce and the landscape is constantly shifting – for example all the top distributions now include the Apache Spark parallel processing framework, whereas a few years ago this was not the case. The growing prominence of Spark has resulted in many vendors increasing the resources dedicated to Spark deployment and support.
One important factor to consider in choosing a Hadoop distribution is whether you want an on-premises or cloud-based solution. If there is no room to compromise when it comes to maintaining complete control and ownership of your data, an on-site solution still theoretically offers the highest level of security. In recent years, though, cloud solutions have become less expensive, more flexible and easier to scale.
Most of the vendor products here can be installed on a cloud or on-premises. However, some cannot be run on-site. These are generally products from web service providers, such as Amazon or Microsoft, running either Hadoop distributions from other, platform-focused vendors such as Hortonworks or MapR, or their own distributions.
Beyond that, all of the top distributions have subtle differences which could make them more or less suitable for your business. Here’s a non-exhaustive guide to some of the most popular on the market today.
Cloudera was the first vendor to offer Hadoop as a package and continues to be a leader in the industry. Its Cloudera CDH distribution, which contains all the open source components, is the most popular Hadoop distribution. Cloudera is known for acting quickly to innovate with additions to the core framework – it was the first to offer SQL-for-Hadoop with its Impala query engine. Other additions include user interface, security and interfaces for integration with third party applications. It offers support for the whole of the distribution through its Cloudera Enterprise subscription service.
Hortonworks’ platform is entirely open source – in fact the company is known for making acquisitions of other companies with useful code and releasing it into the open source community. What some have seen as a start of a trend towards consolidation in the market has prompted a growth in popularity of Hortonworks’ product. Recently Pivotal stopped development of its own distribution and both Amazon and IBM are now offering Hortonworks as options on their own platforms, alongside their own Hadoop distributions. Hortonworks’ platform is also at the core of the Open Data Platform Initiative – a group looking to simplify and standardise specifications in the Big Data ecosphere. In the long run this is likely to mean it will become even more widely supported.
Like Hortonworks and Cloudera, MapR is a platform-focused provider, rather than a managed service provider, like Amazon or Microsoft (see below). MapR integrates its own database system – MapR-DB which it claims is between four and seven times faster than the stock Hadoop database – HBase running on competing distributions. Due to its power and speed MapR is often seen as a good choice for the biggest of Big Data projects
Amazon Elastic Map Reduce
Amazon offers a cloud-only Hadoop-as-a-service platform through its Amazon Web Services arm. A key advantage of the pay-as-you-go model offered by cloud-only service providers is the scalability offered, with storage and data processing able to be ramped up or wound down as demands change. Amazon has recently announced that customers can now use the Apache Flink stream processing framework for real-time data analytics on the platform, along with other popular tools such as Kafka and Presto. It also seamlessly connects (as you would expect) with Amazon’s other cloud services infrastructure such as EC2 for cloud processing, Amazon S3 and DynamoDB for storage and AWS IoT to collect data from Internet of Things-enabled devices.
Microsoft’s Azure HDInsight platform is a cloud-only service which offers managed installations of several open source Hadoop distributions including Hortonworks, Cloudera and MapR. It integrates them with its own Azure Data Lake platform to offer a complete solution for cloud-based storage and analytics. As well as the core Hadoop framework, HDInsights provides Spark, Hive, Kafka and Storm cloud services, and its own cloud security framework.
Acquired recently by SAP for $125 million, Altiscale is another company offering cloud-based, managed Hadoop-as-a-service. It continues to offer its Altiscale Data Cloud product, which includes additional operational services like automation, security, scaling and performance-tuning alongside the core Hadoop framework. Data Cloud also provides managed Spark, Hive and Pig services – like most of the other products here – but unlike the other as-a-service offerings, uses its own Hadoop distribution rather than that of one of the platform-focused vendors such as Hortonworks or MapR.
As with the entire big data ecosystem, things are constantly evolving and I will keep a close eye on the developments over the coming months. In the meantime, I hope that this article has provided some clarity about the current state of commercial Hadoop distributions.