Written by

Bernard Marr

Bernard Marr is a world-renowned futurist, influencer and thought leader in the fields of business and technology, with a passion for using technology for the good of humanity. He is a best-selling author of 20 books, writes a regular column for Forbes and advises and coaches many of the world’s best-known organisations. He has over 2 million social media followers, 1 million newsletter subscribers and was ranked by LinkedIn as one of the top 5 business influencers in the world and the No 1 influencer in the UK.

Bernard’s latest book is ‘Business Trends in Practice: The 25+ Trends That Are Redefining Organisations’

View Latest Book

What is Hadoop?

2 July 2021

When you learn about Big Data you will sooner or later come across this odd sounding word: Hadoop – but what exactly is it?

Put simply, Hadoop can be thought of as a set of open source programs and procedures (meaning essentially they are free for anyone to use or modify, with a few exceptions) which anyone can use as the “backbone” of their big data operations.

I’ll try to keep things simple as I know a lot of people reading this aren’t software engineers, so I hope I don’t over-simplify anything – think of this as a brief guide for someone who wants to know a bit more about the nuts and bolts that make big data analysis possible.






The 4 Modules of Hadoop

Hadoop is made up of “modules”, each of which carries out a particular task essential for a computer system designed for big data analytics.


1. Distributed File-System

The most important two are the Distributed File System, which allows data to be stored in an easily accessible format, across a large number of linked storage devices, and the MapReduce – which provides the basic tools for poking around in the data.

(A “file system” is the method used by a computer to store data, so it can be found and used. Normally this is determined by the computer’s operating system, however a Hadoop system uses its own file system which sits “above” the file system of the host computer – meaning it can be accessed using any computer running any supported OS).


2. MapReduce

MapReduce is named after the two basic operations this module carries out – reading data from the database, putting it into a format suitable for analysis (map), and performing mathematical operations i.e counting the number of males aged 30+ in a customer database (reduce).


3. Hadoop Common

The other module is Hadoop Common, which provides the tools (in Java) needed for the user’s computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.


4. YARN

The final module is YARN, which manages resources of the systems storing the data and running the analysis.

Various other procedures, libraries or features have come to be considered part of the Hadoop “framework” over recent years, but Hadoop Distributed File System, Hadoop MapReduce, Hadoop Common and Hadoop YARN are the principle four.


How Hadoop Came About

Development of Hadoop began when forward-thinking software engineers realised that it was quickly becoming useful for anybody to be able to store and analyze datasets far larger than can practically be stored and accessed on one physical storage device (such as a hard disk).

This is partly because as physical storage devices become bigger it takes longer for the component that reads the data from the disk (which in a hard disk, would be the “head”) to move to a specified segment. Instead, many smaller devices working in parallel are more efficient than one large one.

It was released in 2005 by the Apache Software Foundation, a non-profit organization which produces open source software which powers much of the Internet behind the scenes. And if you’re wondering where the odd name came from, it was the name given to a toy elephant belonging to the son of one of the original creators!


The Usage of Hadoop

The flexible nature of a Hadoop system means companies can add to or modify their data system as their needs change, using cheap and readily-available parts from any IT vendor.

Today, it is the most widely used system for providing data storage and processing across “commodity” hardware – relatively inexpensive, off-the-shelf systems linked together, as opposed to expensive, bespoke systems custom-made for the job in hand. In fact it is claimed that more than half of the companies in the Fortune 500 make use of it.

Just about all of the big online names use it, and as anyone is free to alter it for their own purposes, modifications made to the software by expert engineers at, for example, Amazon and Google, are fed back to the development community, where they are often used to improve the “official” product. This form of collaborative development between volunteer and commercial users is a key feature of open source software.

In its “raw” state – using the basic modules supplied here https://hadoop.apache.org/ by Apache, it can be very complex, even for IT professionals – which is why various commercial versions have been developed such as Cloudera which simplify the task of installing and running a Hadoop system, as well as offering training and support services.

So that, in a (fairly large) nutshell, is Hadoop. Thanks to the flexible nature of the system, companies can expand and adjust their data analysis operations as their business expands. And the support and enthusiasm of the open source community behind it has led to great strides towards making big data analysis more accessible for everyone.

Where to go from here

If you would like to know more about Hadoop or other big data tools, check out my articles on:

Or browse the Big Data section of this site to find more articles and many practical examples.

Data Strategy Book | Bernard Marr

Related Articles

Should I Choose Machine Learning or Big Data | Bernard Marr

Should I Choose Machine Learning or Big Data?

Big Data and Machine Learning are two exciting applications of technology that are often mentioned together in the space of the same breath [...]

What Are The Latest Trends in Data Science | Bernard Marr

What Are The Latest Trends in Data Science?

Here’s an overview of how this usage is evolving – signposts that point the direction of travel between where we are today and where data science will take us tomorrow [...]

3 Key Ways to Monetize Your Data | Bernard Marr

3 Key Ways to Monetize Your Data

I’ve written a book on data strategy, and one of my primary jobs is guiding businesses through the process of using their data effectively. [...]

The Future of Quantum Computing | Bernard Marr

The Future of Quantum Computing

A Chinese team of researchers has recently unveiled the world’s most powerful quantum computer [...]

How Facebook Is Using Artificial Intelligence

Every day, nearly 2.5 billion people log in to one of the [...]

Amazon: Using Big Data to understand customers

Amazon has thrived by adopting an “everything under one [...]

Stay up-to-date

  • Get updates straight to your inbox
  • Join my 1 million newsletter subscribers
  • Never miss any new content

Social Media

0
Followers
0
Likes
0
Followers
0
Subscribers
0
Followers
0
Subscribers
0
Followers
0
Readers

Podcasts

View Podcasts