Data Science is the theory and practice powering the data-driven transformations we are seeing across industry and society today.
From Artificial Intelligence to self-driving cars and predictive analytics are just of the few breakthroughs have been made thanks to our ever-growing ability to collect and analyse data.
Just as with Big Data and Artificial Intelligence the field of Data Science has developed its own lexicon which can be confusing at first for beginners. An understanding of the basic terminology and frequently used terms is essential for anyone thinking about how this technology can be applied. So here is my run through of some of the technologies, phrases and buzzwords you are likely to come across.
When carrying out scientific data analysis using personal data (data which identifies a person), anonymization refers to the process of removing or obfuscating indicators in the data which show who it specifically refers to. This is not always as simple as it sounds as people can be identified by more than just their name. Properly anonymized data is no longer considered “personal” and there are commonly less legal and ethical restrictions on how it can be used.
Repeatable sets of instructions which people or machines can use to process data. Typically, algorithms are constructed by feeding data into them and adjusting variables until a desired outcome is achieved. Thanks to breakthroughs in AI such as machine learning and neural networks today machines generally do this, as they can do it far more quickly than any human.
Today’s AIs are built on concepts developed through the study and application of data science. One way to categorize the latest wave of “intelligent” machines is as machines which are capable of carrying out data science for themselves. Rather than simply process the data they are fed, in the way they are told, they can learn it and adapt to become better at processing it. This is how Google Translate becomes better at understanding language, and how autonomous cars will navigate areas they have never visited before.
A mathematical formula used to predict the probability of one event occurring in relation to whether or not another event has occurred. It is a commonly used technique used in data science to establish probabilities and outcomes which are dependent on unknown variables, and is used to build Bayesian Networks, where the principle is applied across large datasets.
The use of data on a person or object’s behaviour to make predictions on how it might change in the future (see predictive modelling in Part II) or determining variables which affect it, so more favourable or efficient outcomes might be achieved.
Big Data is the “buzzword” term which has come to represent the vast increase in the amount of data which has become available in recent years, particularly as the world has increasingly become online and connected through the internet. This data is distinguished from data previously available not just by its size but also the high speed at which it is generated, and the large variations in the forms it can take. It greatly expands the potential of what can be achieved with data science, which was previously hampered by slow computer processing speeds and the difficulty of capturing accurate information in large volumes, before the widespread digitization.
Citizen Data Scientist
Sometimes also referred to as an “armchair data scientist”. One of the increasing number of people who although not academically trained or professionally employed primarily as data scientists, are able to use data science tools and techniques to improve the use of information in their own field of study or work. This is increasingly becoming possible thanks to the growing number of automated or “self service” tools and platforms for data analytics.
The ability to use data (about an object, event or anything else) to determine which of a number of predetermined groups an item belongs in. For a basic example, an image recognition analysis might classify all shapes with four equal sides as squares, and all shapes with three sides as triangles.
Analysis of the way humans interact with computers or use machinery – the name refers to recording and analyzing where a mouse is clicked on a screen (with the sequence of interactive actions taken by the user known as the “clickstream”) but it can be apply to any method of interaction that can be measured – such as manual operation of machinery using a joystick or control panel, or voice recognition.
Clustering is also about grouping objects together but it differs because it is used when there are no predetermined groups. Objects (or events) are clustered together due to similarities they share and algorithms determine what that common relationship between them may be. Clustering is a data science technique which makes unsupervised learning possible.
Rules which establish how data should be used, in order to both comply with legislation and ensure the integrity of data and data-driven initiatives.
The process of examining a set of data to determine relationships between variables which could affect outcomes – generally at large scale and by machines. Data mining is an older term used by computer scientists and in business to describe the basic function of a data scientist or a data science initiative.
The entire collection of data that will be used in a particular data science initiative. In modern complex Big Data projects, it can involve many types of data gathered from different sources.
A person who applies the scientific method of observing, recording, analysing and reporting results to understand information and use it to solve problems.
Democratization of Data Science
The idea that data science tools and techniques are increasingly accessible to a growing number of people, rather than only those in academia or industry with access to large budgets. See also Citizen Data Scientist.
A basic decision-making structure which can be used by a computer to understand and classify information. By asking a series of questions about each data item fed into them, outputs are channeled along different branches leading to different outcomes, typically labelling or classification of the piece of data.
Data can be stored in a database which has one dimension – a list, or two dimensions – a grid made up of rows and columns. It can also be stored in multi-dimensional databases which can take the form of a grid, with three axes, or even more complex permutations, which are not possible to relate to common geospatial objects, thanks to the power of CPU processing. More complex dimensional structures typically allow for more connections to be observed between the data objects which are being analyzed.
A database which is held in a computer’s RAM memory where it can be accessed and operated far more quickly than if the data is read from a disk whenever it needs to be accessed. This is something which has become possible in recent years while it was very difficult to do with large data sets in the past, due to the increase in size of available memory, and the fall in the cost of physical RAM chips.
Data about data, or data attached to other data – for example with an image file this would be information about its size, when it was created, what camera was used to take it, or which version of a software package it was created in.
A variable where the value is very different from that which is expected considering the value of other variables in the dataset. These can be indicators of rare or unexpected events, or of unreliable data.
Using data to predict the future. Rather than a crystal ball or tealeaves, data scientists use probability and statistics to determine what is most likely to happen next. The more data that is available from past events, the more likely that algorithms can give a prediction with a high probability of proving correct. Predictive modelling involves running a large number of simulated events in order to determine the variables most likely to produce a desired outcome.
Python is a programming language which has become highly popular with data scientists in recent years due to its relative ease of use and the sophisticated ways it can be used to work with large, fast-moving datasets. Its open source (anyone can add to it or change it) nature means its capabilities are constantly being expanded, and new resources are becoming available.
A group of objects which have been classified according to similar characteristics, and then distributed evenly between a number of such groups. These are distinguished as “quartile” if there are four such groups, “quintile” if there are five such groups, etc. The “first quartile” would refer to the top quarter of entries in a list which has been split into four equal groups.
Another programming language which has been around for longer than Python and traditionally was the choice for statisticians working with large data sets is R. Although Python is quickly gaining in popularity R is still heavily used by data scientists and is commonly taught on data science courses at universities.
A random forest is a method of statistical analysis which involves taking the output of a large number of decision trees (see above) and analyzing them together, to provide a more complex and detailed understanding or classification of data than would be possible with just one tree. As with decision trees this is a technique that has been around in statistics for a long time but modern computers allow for far more complex trees and forests, leading to more accurate predictions.
A common calculation in data science used to measure how far removed a variable, statistic or measurement is from the average. This can be used to determine how closely a piece of data fits to the norm of whatever it represents (speed of movement, temperature of a piece of machinery, population size of a developed area) and allows inferences to be made on why it differs from the norm.