What is Hadoop?
2 July 2021
When you learn about Big Data you will sooner or later come across this odd sounding word: Hadoop – but what exactly is it?
Put simply, Hadoop can be thought of as a set of open source programs and procedures (meaning essentially they are free for anyone to use or modify, with a few exceptions) which anyone can use as the “backbone” of their big data operations.
I’ll try to keep things simple as I know a lot of people reading this aren’t software engineers, so I hope I don’t over-simplify anything – think of this as a brief guide for someone who wants to know a bit more about the nuts and bolts that make big data analysis possible.
The 4 Modules of Hadoop
Hadoop is made up of “modules”, each of which carries out a particular task essential for a computer system designed for big data analytics.
1. Distributed File-System
The most important two are the Distributed File System, which allows data to be stored in an easily accessible format, across a large number of linked storage devices, and the MapReduce – which provides the basic tools for poking around in the data.
(A “file system” is the method used by a computer to store data, so it can be found and used. Normally this is determined by the computer’s operating system, however a Hadoop system uses its own file system which sits “above” the file system of the host computer – meaning it can be accessed using any computer running any supported OS).
2. MapReduce
MapReduce is named after the two basic operations this module carries out – reading data from the database, putting it into a format suitable for analysis (map), and performing mathematical operations i.e counting the number of males aged 30+ in a customer database (reduce).
3. Hadoop Common
The other module is Hadoop Common, which provides the tools (in Java) needed for the user’s computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.
4. YARN
The final module is YARN, which manages resources of the systems storing the data and running the analysis.
Various other procedures, libraries or features have come to be considered part of the Hadoop “framework” over recent years, but Hadoop Distributed File System, Hadoop MapReduce, Hadoop Common and Hadoop YARN are the principle four.
How Hadoop Came About
Development of Hadoop began when forward-thinking software engineers realised that it was quickly becoming useful for anybody to be able to store and analyze datasets far larger than can practically be stored and accessed on one physical storage device (such as a hard disk).
This is partly because as physical storage devices become bigger it takes longer for the component that reads the data from the disk (which in a hard disk, would be the “head”) to move to a specified segment. Instead, many smaller devices working in parallel are more efficient than one large one.
It was released in 2005 by the Apache Software Foundation, a non-profit organization which produces open source software which powers much of the Internet behind the scenes. And if you’re wondering where the odd name came from, it was the name given to a toy elephant belonging to the son of one of the original creators!
The Usage of Hadoop
The flexible nature of a Hadoop system means companies can add to or modify their data system as their needs change, using cheap and readily-available parts from any IT vendor.
Today, it is the most widely used system for providing data storage and processing across “commodity” hardware – relatively inexpensive, off-the-shelf systems linked together, as opposed to expensive, bespoke systems custom-made for the job in hand. In fact it is claimed that more than half of the companies in the Fortune 500 make use of it.
Just about all of the big online names use it, and as anyone is free to alter it for their own purposes, modifications made to the software by expert engineers at, for example, Amazon and Google, are fed back to the development community, where they are often used to improve the “official” product. This form of collaborative development between volunteer and commercial users is a key feature of open source software.
In its “raw” state – using the basic modules supplied here https://hadoop.apache.org/ by Apache, it can be very complex, even for IT professionals – which is why various commercial versions have been developed such as Cloudera which simplify the task of installing and running a Hadoop system, as well as offering training and support services.
So that, in a (fairly large) nutshell, is Hadoop. Thanks to the flexible nature of the system, companies can expand and adjust their data analysis operations as their business expands. And the support and enthusiasm of the open source community behind it has led to great strides towards making big data analysis more accessible for everyone.
Where to go from here
If you would like to know more about Hadoop or other big data tools, check out my articles on:
- The 6 Best Hadoop Vendors For Your Big Data Project
- Spark or Hadoop — Which is the best Big Data Framework?
- What Is Spark – An Easy Explanation For Absolutely Anyone
- What Is Kafka? A Super-Simple Explanation Of This Important Data Analytics Tool
Or browse the Big Data section of this site to find more articles and many practical examples.
Related Articles
Will AI Solve The World’s Inequality Problem – Or Make It Worse?
We are standing on the cusp of a new technological revolution. AI is increasingly permeating every aspect of our lives, with intelligent machines transforming the way we live and work.[...]
How You Become Irreplaceable In The Age Of AI
In a world where artificial intelligence is rapidly advancing, many of us are left wondering: Will AI take our jobs?[...]
Why Apple Intelligence Sets A New Gold Standard For AI Privacy
In the rapidly evolving world of artificial intelligence, privacy concerns have become a hot-button issue.[...]
Can Your Device Run Apple Intelligence? What You Need To Know
Apple's announcement of Apple Intelligence has sent waves of excitement through the tech world.[...]
10 Amazing Things You Can Do With Apple Intelligence On Your IPhone
Apple Intelligence is poised to revolutionize the iPhone experience, offering a suite of AI-powered tools that promise to make your digital life easier, more productive, and more creative.[...]
Agentic AI: The Next Big Breakthrough That’s Transforming Business And Technology
The world of artificial intelligence is evolving at a breakneck pace, and just when you thought you'd wrapped your head around generative AI, along comes another game-changing concept: agentic AI.[...]
Sign up to Stay in Touch!
Bernard Marr is a world-renowned futurist, influencer and thought leader in the fields of business and technology, with a passion for using technology for the good of humanity.
He is a best-selling author of over 20 books, writes a regular column for Forbes and advises and coaches many of the world’s best-known organisations.
He has a combined following of 4 million people across his social media channels and newsletters and was ranked by LinkedIn as one of the top 5 business influencers in the world.
Bernard’s latest book is ‘Generative AI in Practice’.
Social Media