These tasks could be anything from answering questions and creating text or images (as demonstrated by apps like ChatGPT or Dall-E) to recognizing images (computer vision) or navigating self-driving autonomous vehicles from A to B.
All of these tasks require data, and businesses that want to train their own ML algorithms in order to automate their day-to-day tasks need sources of data.
What types of data are there?
Business data is commonly divided into one of two categories – internal and external data.
Internal data is data collected by organizations themselves from within their own operations. This commonly includes financial data, customer feedback data, HR data, operational data, and many more sources. Data collected by an organization monitoring its own operations is said to be proprietary data, and is valuable because it gives information specific to that business.
External data comes from sources outside of the organization and is typically collected from third-party data sources such as those listed below. If data is freely available to anyone, it is called open data.
Further to this, data can also be classified as either structured, unstructured, or semi-structured data.
Structured data is information that fits nicely and neatly into tables – for example, sales data showing what products were sold by a business, when, where, and at what price would be an example of internal, structured data. Alternatively, it might choose to analyze historical market data and economic indicators to predict future movements in the markets they operate in (structured, external data).
Unstructured data is everything else – for example, pictures, videos, text, and social media posts. It can certainly contain valuable insights but is more difficult to analyze. AI, however, has proven particularly useful for extracting meaning from unstructured data. Image recognition algorithms, for example, might tell a business useful facts about customer behavior by analyzing in-store CCTV images (internal, unstructured data). They might also find valuable insights by analyzing images related to the business posted on social media (unstructured, external data).
Luckily, data is everywhere. Whatever you’re trying to do, if it requires external data, there’s likely to be a source for it online. Governments, research institutions, private companies, and non-governmental organizations all routinely make data freely available for research and even commercial purposes. So here are some of the best sources of free online data available in 2023.
Data Search Engines and Repositories
Google Dataset Search – This is actually a search engine for datasets cataloged by Google; use this to find data on just about anything you could need.
AWS Open Data Search – Another dataset search engine, this one, is provided by Amazon's AWS service.
Microsoft Research Open Data – Free, open datasets collected by Microsoft, with a mainly scientific focus.
UCI Machine Learning Repository – A repository of more than 600 open datasets curated and maintained by the University of California, Irvine, and made available for the purpose of training machine learning algorithms.
Kaggle Datasets – Online data science platform Kaggle also offers a curated catalog of datasets covering everything from university rankings to trending Google searches, retail sales, online movie reviews, and crime statistics.
Reddit R/Datasets – A vast collection of datasets submitted by users of the online community site Reddit covering literally hundreds of subjects.
Government and Inter-Governmental Organization Datasets
Data.Gov – Open data portal provided by the US government, hosting nearly a quarter of a million datasets published by all government agencies.
Data.Census.Gov – If you’re specifically looking for US demographic data, this is a good place to start!
Data.EU – The European Union's open data portal contains data from EU organizations and member state governmental data.
Data.gov.uk – Open data sets published by UK government agencies.
World Health Organization Data – Datasets related to global health and wellbeing.
World Bank Open Data – Datasets related to economic development, international financial markets, social indicators, and environmental issues.
Google Open Images – Millions of images classified and labeled in various ways, suitable for training many different types of computer vision algorithms.
ImageNet Open Dataset – Another dataset consisting of labeled images that’s free to use for non-commercial machine learning applications.
COCO Dataset – Common Objects in Context (COCO) is a dataset consisting of over 200,000 images selected for training object detection and captioning algorithms.
Mozilla Common Voice – An open dataset of voice recordings that can be used to train any AI application that involves speech.
Audioset – Another Google-curated dataset, this one focusing on sounds and containing hundreds of thousands of 10-second samples broken down into categories such as musical instruments, vehicles, and vocals.
Million Song Dataset – Samples and metadata from one million contemporary popular music tracks.
Wikidata – Database downloads of Wikipedia articles in a number of different formats.
Common Crawl – An open repository of data scraped from the world wide web, famously used to train the GPT large language models powering ChatGPT and many other chatbots.
Other and Miscellaneous Datasets
Amazon Reviews – A database of around 35 million reviews for Amazon products, including product information and ratings.
Waymo Open Dataset – Alphabet’s autonomous driving subsidiary Waymo makes a huge amount of data collected via self-driving vehicles publicly accessible, including sensor data from cameras and LiDAR.
Apolloscape Dataset – More autonomous driving data, this time provided by Baidu’s open-source Apollo platform.