Written by

Bernard Marr

Bernard Marr is a world-renowned futurist, influencer and thought leader in the fields of business and technology, with a passion for using technology for the good of humanity. He is a best-selling author of over 20 books, writes a regular column for Forbes and advises and coaches many of the world’s best-known organisations. He has a combined following of 4 million people across his social media channels and newsletters and was ranked by LinkedIn as one of the top 5 business influencers in the world.

Bernard’s latest books are ‘Future Skills’, ‘The Future Internet’, ‘Business Trends in Practice’ and ‘Generative AI in Practice’.

Follow Me

Bernard Marr ist ein weltbekannter Futurist, Influencer und Vordenker in den Bereichen Wirtschaft und Technologie mit einer Leidenschaft für den Einsatz von Technologie zum Wohle der Menschheit. Er ist Bestsellerautor von 20 Büchern, schreibt eine regelmäßige Kolumne für Forbes und berät und coacht viele der weltweit bekanntesten Organisationen. Er hat über 2 Millionen Social-Media-Follower, 1 Million Newsletter-Abonnenten und wurde von LinkedIn als einer der Top-5-Business-Influencer der Welt und von Xing als Top Mind 2021 ausgezeichnet.

Bernards neueste Bücher sind ‘Künstliche Intelligenz im Unternehmen: Innovative Anwendungen in 50 Erfolgreichen Unternehmen’

Follow Me

Spark Or Hadoop — Which Is The Best Big Data Framework?

2 July 2021

One question I get asked a lot by my clients is: Should we go for Hadoop or Spark as our big data framework? Spark has overtaken Hadoop as the most active open source Big Data project. While they are not directly comparable products, they both have many of the same uses.

To shed some light onto the issue of “Spark vs. Hadoop.” I thought an article explaining the essential differences and similarities of each might be useful. As always, I have tried to keep it accessible to anyone, including those without a background in computer science.

Hadoop and Spark are both Big Data frameworks–they provide some of the most popular tools used to carry out common Big Data-related tasks.

Hadoop, for many years, was the leading open source Big Data framework but recently the newer and more advanced Spark has become the more popular of the two Apache APA -3.97% Software Foundation tools.

However they do not perform exactly the same tasks, and they are not mutually exclusive, as they are able to work together. Although Spark is reported to work up to 100 times faster than Hadoop in certain circumstances, it does not provide its own distributed storage system.

Distributed storage is fundamental to many of today’s Big Data projects as it allows vast multi-petabyte datasets to be stored across an almost infinite number of everyday computer hard drives, rather than involving hugely costly custom machinery which would hold it all on one device. These systems are scalable, meaning that more drives can be added to the network as the dataset grows in size.

As I mentioned, Spark does not include its own system for organising files in a distributed way (the file system) so it requires one provided by a third-party. For this reason many Big Data projects involve installing Spark on top of Hadoop, where Spark’s advanced analytics applications can make use of data stored using the Hadoop Distributed File System (HDFS).

What really gives Spark the edge over Hadoop is speed. Spark handles most of its operations “in memory” – copying them from the distributed physical storage into far faster logical RAM memory. This reduces the amount of time consuming writing and reading to and from slow, clunky mechanical hard drives that needs to be done under Hadoop’s MapReduce system.

MapReduce writes all of the data back to the physical storage medium after each operation. This was originally done to ensure a full recovery could be made in case something goes wrong – as data held electronically in RAM is more volatile than that stored magnetically on discs. However Spark arranges data in what are known as Resilient Distributed Datasets, which can be recovered following failure.

Spark’s functionality for handling advanced data processing tasks such as real time stream processing and machine learning is way ahead of what is possible with Hadoop alone. This, along with the gain in speed provided by in-memory operations, is the real reason, in my opinion, for its growth in popularity. Real-time processing means that data can be fed into an analytical application the moment it is captured, and insights immediately fed back to the user through a dashboard, to allow action to be taken. This sort of processing is increasingly being used in all sorts of Big Data applications, for example recommendation engines used by retailers, or monitoring the performance of industrial machinery in the manufacturing industry.

Machine learning–creating algorithms which can “think” for themselves, allowing them to improve and “learn” through a process of statistical modelling and simulation, until an ideal solution to a proposed problem is found, is an area of analytics which is well suited to the Spark platform, thanks to its speed and ability to handle streaming data. This sort of technology lies at the heart of the latest advanced manufacturing systems used in industry which can predict when parts will go wrong and when to order replacements, and will also lie at the heart of the driverless cars and ships of the near future. Spark includes its own machine learning libraries, called MLib, whereas Hadoop systems must be interfaced with a third-party machine learning library, for example Apache Mahout.

The reality is, although the existence of the two Big Data frameworks is often pitched as a battle for dominance, that isn’t really the case. There is some crossover of function, but both are non-commercial products so it isn’t really “competition” as such, and the corporate entities which do make money from providing support and installation of these free-to-use systems will often offer both services, allowing the buyer to pick and choose which functionality they require from each framework.

Many of the big vendors (i.e Cloudera) now offer Spark as well as Hadoop, so will be in a good position to advise companies on which they will find most suitable, on a job-by-job basis. For example, if your Big Data simply consists of a huge amount of very structured data (i.e customer names and addresses) you may have no need for the advanced streaming analytics and machine learning functionality provided by Spark. This means you would be wasting time, and probably money, having it installed as a separate layer over your Hadoop storage. Spark, although developing very quickly, is still in its infancy, and the security and support infrastructure is not as advanced.

The increasing amount of Spark activity taking place (when compared to Hadoop activity) in the open source community is, in my opinion, a further sign that everyday business users are finding increasingly innovative uses for their stored data. The open source principle is a great thing, in many ways, and one of them is how it enables seemingly similar products to exist alongside each other – vendors can sell both (or rather, provide installation and support services for both, based on what their customers actually need in order to extract maximum value from their data).

The New HR Playbook: Catalyze Innovation With Analytics And AI

Beneath the surface of every HR function, there lies a treasure trove of data. But if that[...]

The Eight Biggest HR Trends In 2024

For those working in employee and people management, the focus in 2024 will be on managing[...]

The New Frontier In Workplace Safety: Data Analytics And AI

Almost all employers want to ensure their workplaces are safe zones that are free[...]

The Biggest Banking And Financial Services Trends For 2024

2024 promises to be a landmark year in banking and finance, marked by significant[...]

The Evolution Of Data-Driven And AI-Enabled HR

The pulse of any organization lies not just in its products or services but in its people.[...]

How Data And AI Are Reshaping Contemporary HR Practices

The world of human resources (HR) stands on the precipice of an exciting era powered by data and AI.[...]

Sign up to Stay in Touch!

Bernard Marr is a world-renowned futurist, influencer and thought leader in the fields of business and technology, with a passion for using technology for the good of humanity.

He is a best-selling author of over 20 books, writes a regular column for Forbes and advises and coaches many of the world’s best-known organisations.

He has a combined following of 4 million people across his social media channels and newsletters and was ranked by LinkedIn as one of the top 5 business influencers in the world.

Bernard’s latest book is ‘Generative AI in Practice’.

Podcasts

View Podcasts

Bernard Marr

Follow Me

Follow Me

Spark Or Hadoop — Which Is The Best Big Data Framework?

Related Articles

The New HR Playbook: Catalyze Innovation With Analytics And AI

The Eight Biggest HR Trends In 2024

The New Frontier In Workplace Safety: Data Analytics And AI

The Biggest Banking And Financial Services Trends For 2024

The Evolution Of Data-Driven And AI-Enabled HR

How Data And AI Are Reshaping Contemporary HR Practices

Sign up to Stay in Touch!

Social Media

Podcasts