Written by

Bernard Marr

Bernard Marr is a world-renowned futurist, influencer and thought leader in the fields of business and technology, with a passion for using technology for the good of humanity. He is a best-selling author of 20 books, writes a regular column for Forbes and advises and coaches many of the world’s best-known organisations. He has over 2 million social media followers, 1 million newsletter subscribers and was ranked by LinkedIn as one of the top 5 business influencers in the world and the No 1 influencer in the UK.

Bernard’s latest book is ‘Business Trends in Practice: The 25+ Trends That Are Redefining Organisations’

View Latest Book

Follow Me

Bernard Marr ist ein weltbekannter Futurist, Influencer und Vordenker in den Bereichen Wirtschaft und Technologie mit einer Leidenschaft für den Einsatz von Technologie zum Wohle der Menschheit. Er ist Bestsellerautor von 20 Büchern, schreibt eine regelmäßige Kolumne für Forbes und berät und coacht viele der weltweit bekanntesten Organisationen. Er hat über 2 Millionen Social-Media-Follower, 1 Million Newsletter-Abonnenten und wurde von LinkedIn als einer der Top-5-Business-Influencer der Welt und von Xing als Top Mind 2021 ausgezeichnet.

Bernards neueste Bücher sind ‘Künstliche Intelligenz im Unternehmen: Innovative Anwendungen in 50 Erfolgreichen Unternehmen’

View Latest Book

Follow Me

What is Hadoop?

2 July 2021

When you learn about Big Data you will sooner or later come across this odd sounding word: Hadoop – but what exactly is it?

Put simply, Hadoop can be thought of as a set of open source programs and procedures (meaning essentially they are free for anyone to use or modify, with a few exceptions) which anyone can use as the “backbone” of their big data operations.

I’ll try to keep things simple as I know a lot of people reading this aren’t software engineers, so I hope I don’t over-simplify anything – think of this as a brief guide for someone who wants to know a bit more about the nuts and bolts that make big data analysis possible.






The 4 Modules of Hadoop

Hadoop is made up of “modules”, each of which carries out a particular task essential for a computer system designed for big data analytics.


1. Distributed File-System

The most important two are the Distributed File System, which allows data to be stored in an easily accessible format, across a large number of linked storage devices, and the MapReduce – which provides the basic tools for poking around in the data.

(A “file system” is the method used by a computer to store data, so it can be found and used. Normally this is determined by the computer’s operating system, however a Hadoop system uses its own file system which sits “above” the file system of the host computer – meaning it can be accessed using any computer running any supported OS).


2. MapReduce

MapReduce is named after the two basic operations this module carries out – reading data from the database, putting it into a format suitable for analysis (map), and performing mathematical operations i.e counting the number of males aged 30+ in a customer database (reduce).


3. Hadoop Common

The other module is Hadoop Common, which provides the tools (in Java) needed for the user’s computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.


4. YARN

The final module is YARN, which manages resources of the systems storing the data and running the analysis.

Various other procedures, libraries or features have come to be considered part of the Hadoop “framework” over recent years, but Hadoop Distributed File System, Hadoop MapReduce, Hadoop Common and Hadoop YARN are the principle four.


How Hadoop Came About

Development of Hadoop began when forward-thinking software engineers realised that it was quickly becoming useful for anybody to be able to store and analyze datasets far larger than can practically be stored and accessed on one physical storage device (such as a hard disk).

This is partly because as physical storage devices become bigger it takes longer for the component that reads the data from the disk (which in a hard disk, would be the “head”) to move to a specified segment. Instead, many smaller devices working in parallel are more efficient than one large one.

It was released in 2005 by the Apache Software Foundation, a non-profit organization which produces open source software which powers much of the Internet behind the scenes. And if you’re wondering where the odd name came from, it was the name given to a toy elephant belonging to the son of one of the original creators!


The Usage of Hadoop

The flexible nature of a Hadoop system means companies can add to or modify their data system as their needs change, using cheap and readily-available parts from any IT vendor.

Today, it is the most widely used system for providing data storage and processing across “commodity” hardware – relatively inexpensive, off-the-shelf systems linked together, as opposed to expensive, bespoke systems custom-made for the job in hand. In fact it is claimed that more than half of the companies in the Fortune 500 make use of it.

Just about all of the big online names use it, and as anyone is free to alter it for their own purposes, modifications made to the software by expert engineers at, for example, Amazon and Google, are fed back to the development community, where they are often used to improve the “official” product. This form of collaborative development between volunteer and commercial users is a key feature of open source software.

In its “raw” state – using the basic modules supplied here https://hadoop.apache.org/ by Apache, it can be very complex, even for IT professionals – which is why various commercial versions have been developed such as Cloudera which simplify the task of installing and running a Hadoop system, as well as offering training and support services.

So that, in a (fairly large) nutshell, is Hadoop. Thanks to the flexible nature of the system, companies can expand and adjust their data analysis operations as their business expands. And the support and enthusiasm of the open source community behind it has led to great strides towards making big data analysis more accessible for everyone.

Where to go from here

If you would like to know more about Hadoop or other big data tools, check out my articles on:

Or browse the Big Data section of this site to find more articles and many practical examples.

Business Trends In Practice | Bernard Marr
Business Trends In Practice | Bernard Marr

Related Articles

Google’s New Performance Management Update

American companies are in the midst of dealing with what has been dubbed “The Great Resignation,” an exodus of employees seeking higher pay[...]

What You Need To Know Before You Start Working With Artificial Intelligence

It seems like everyone is talking about artificial intelligence at the moment, and there’s good reason for that. We are seeing its revolutionary impact across just about every industry.[...]

Why Is Data Governance So Important To Every Organisation?

In business today, data is understood to be the key to improving every aspect of how we plan, administer, design, build, sell and look after our customers.[...]

Why External Data Is So Important For Every Business

Internal data is often the first place that companies look when they start to think about analytics and insights.[...]

How To Make Money From Data: The Essential Data Monetization Tips

Data has become the main raw material of the 4th Industrial Revolution, and making money from data has become a huge business opportunity.[...]

Is Space The Next Frontier For Agriculture And Biology?

Space exploration is very much in vogue again in recent years thanks to the exploits of billionaires like Jeff Bezos and Richard Branson.[...]

Stay up-to-date

  • Get updates straight to your inbox
  • Join my 1 million newsletter subscribers
  • Never miss any new content

Social Media

0
Followers
0
Followers
0
Followers
0
Subscribers
0
Followers
0
Subscribers
0
Yearly Views
0
Readers

Podcasts

View Podcasts