Written by

Bernard Marr

Bernard Marr is a world-renowned futurist, influencer and thought leader in the fields of business and technology, with a passion for using technology for the good of humanity. He is a best-selling and award-winning author of over 20 books, writes a regular column for Forbes and advises and coaches many of the world’s best-known organisations. He has a combined following of 5 million people across his social media channels and newsletters and was ranked by LinkedIn as one of the top 5 business influencers in the world.

Bernard’s latest books are ‘Future Skills’’, ‘Generative AI in Practice’ ‘Data Strategy 3rd Ed’ and ‘AI Strategy‘.

Follow Me

Bernard Marr ist ein weltbekannter Futurist, Influencer und Vordenker in den Bereichen Wirtschaft und Technologie mit einer Leidenschaft für den Einsatz von Technologie zum Wohle der Menschheit. Er ist Bestsellerautor von 20 Büchern, schreibt eine regelmäßige Kolumne für Forbes und berät und coacht viele der weltweit bekanntesten Organisationen. Er hat über 2 Millionen Social-Media-Follower, 1 Million Newsletter-Abonnenten und wurde von LinkedIn als einer der Top-5-Business-Influencer der Welt und von Xing als Top Mind 2021 ausgezeichnet.

Bernards neueste Bücher sind ‘Künstliche Intelligenz im Unternehmen: Innovative Anwendungen in 50 Erfolgreichen Unternehmen’

Follow Me

20+ Amazing (And Free) Data Sources Anyone Can Use To Build AIs

2 June 2023

When we talk about artificial intelligence (AI) in business and society today, what we really mean is machine learning (ML). This refers to applications that use algorithms (a set of instructions) to become increasingly good at performing a particular task as it is exposed to more and more data relating to that task.

20+ Amazing (And Free) Data Sources Anyone Can Use To Build AIs | Bernard Marr

These tasks could be anything from answering questions and creating text or images (as demonstrated by apps like ChatGPT or Dall-E) to recognizing images (computer vision) or navigating self-driving autonomous vehicles from A to B.

All of these tasks require data, and businesses that want to train their own ML algorithms in order to automate their day-to-day tasks need sources of data.

What types of data are there?

Business data is commonly divided into one of two categories – internal and external data.

Internal data is data collected by organizations themselves from within their own operations. This commonly includes financial data, customer feedback data, HR data, operational data, and many more sources. Data collected by an organization monitoring its own operations is said to be proprietary data, and is valuable because it gives information specific to that business.

External data comes from sources outside of the organization and is typically collected from third-party data sources such as those listed below. If data is freely available to anyone, it is called open data.

Further to this, data can also be classified as either structured, unstructured, or semi-structured data.

Structured data is information that fits nicely and neatly into tables – for example, sales data showing what products were sold by a business, when, where, and at what price would be an example of internal, structured data. Alternatively, it might choose to analyze historical market data and economic indicators to predict future movements in the markets they operate in (structured, external data).

Unstructured data is everything else – for example, pictures, videos, text, and social media posts. It can certainly contain valuable insights but is more difficult to analyze. AI, however, has proven particularly useful for extracting meaning from unstructured data. Image recognition algorithms, for example, might tell a business useful facts about customer behavior by analyzing in-store CCTV images (internal, unstructured data). They might also find valuable insights by analyzing images related to the business posted on social media (unstructured, external data).

Luckily, data is everywhere. Whatever you’re trying to do, if it requires external data, there’s likely to be a source for it online. Governments, research institutions, private companies, and non-governmental organizations all routinely make data freely available for research and even commercial purposes. So here are some of the best sources of free online data available in 2023.

Data Search Engines and Repositories

Google Dataset Search – This is actually a search engine for datasets cataloged by Google; use this to find data on just about anything you could need.

AWS Open Data Search – Another dataset search engine, this one, is provided by Amazon's AWS service.

Microsoft Research Open Data – Free, open datasets collected by Microsoft, with a mainly scientific focus.

UCI Machine Learning Repository – A repository of more than 600 open datasets curated and maintained by the University of California, Irvine, and made available for the purpose of training machine learning algorithms.

Kaggle Datasets – Online data science platform Kaggle also offers a curated catalog of datasets covering everything from university rankings to trending Google searches, retail sales, online movie reviews, and crime statistics.

Reddit R/Datasets – A vast collection of datasets submitted by users of the online community site Reddit covering literally hundreds of subjects.

Government and Inter-Governmental Organization Datasets

Data.Gov – Open data portal provided by the US government, hosting nearly a quarter of a million datasets published by all government agencies.

Data.Census.Gov – If you’re specifically looking for US demographic data, this is a good place to start!

Data.EU – The European Union's open data portal contains data from EU organizations and member state governmental data.

Data.gov.uk – Open data sets published by UK government agencies.

World Health Organization Data – Datasets related to global health and wellbeing.

World Bank Open Data – Datasets related to economic development, international financial markets, social indicators, and environmental issues.

Image Data

Google Open Images – Millions of images classified and labeled in various ways, suitable for training many different types of computer vision algorithms.

ImageNet Open Dataset – Another dataset consisting of labeled images that’s free to use for non-commercial machine learning applications.

COCO Dataset – Common Objects in Context (COCO) is a dataset consisting of over 200,000 images selected for training object detection and captioning algorithms.

Sound Data

Mozilla Common Voice – An open dataset of voice recordings that can be used to train any AI application that involves speech.

Audioset – Another Google-curated dataset, this one focusing on sounds and containing hundreds of thousands of 10-second samples broken down into categories such as musical instruments, vehicles, and vocals.

Million Song Dataset – Samples and metadata from one million contemporary popular music tracks.

Text Data

Wikidata – Database downloads of Wikipedia articles in a number of different formats.

Common Crawl – An open repository of data scraped from the world wide web, famously used to train the GPT large language models powering ChatGPT and many other chatbots.

Other and Miscellaneous Datasets

Amazon Reviews – A database of around 35 million reviews for Amazon products, including product information and ratings.

Waymo Open Dataset – Alphabet’s autonomous driving subsidiary Waymo makes a huge amount of data collected via self-driving vehicles publicly accessible, including sensor data from cameras and LiDAR.

Apolloscape Dataset – More autonomous driving data, this time provided by Baidu’s open-source Apollo platform.