When a conversation turns to analytics or big data, the terms structured, semi-structured and unstructured might get bandied about. These are classifications of data that are now important to understand with the rapid increase of semi-structured and unstructured data today as well as the development of tools that make managing and analysing these classes of data possible. Here’s what you need to know.
Data that is the easiest to search and organise, because it is usually contained in rows and columns and its elements can be mapped into fixed pre-defined fields, is known as structured data. Think about what data you might store in an Excel spreadsheet and you have an example of structured data. Structured data can follow a data model a database designer creates – think of sales records by region, by product or by customer. In structured data, entities can be grouped together to form relations (‘customers’ that are also ‘satisfied with the service). This makes structured data easy to store, analyse and search and until recently was the only data easily usable for businesses. Today, most estimate structured data accounts for less than 20 percent of all data.
Often structured data is managed using Structured Query Language (SQL)—a programming software language developed by IBM in the 1970s for relational databases.
Structured data can be created by machines and humans. Examples of structured data include financial data such as accounting transactions, address details, demographic information, star ratings by customers, machines logs, location data from smart phones and smart devices, etc.
A much bigger percentage of all the data is our world is unstructured data. Unstructured data is data that cannot be contained in a row-column database and doesn’t have an associated data model. Think of the text of an email message. The lack of structure made unstructured data more difficult to search, manage and analyse, which is why companies have widely discarded unstructured data, until the recent proliferation of artificial intelligence and machine learning algorithms made it easier to process.
Other examples of unstructured data include photos, video and audio files, text files, social media content, satellite imagery, presentations, PDFs, open-ended survey responses, websites and call centre transcripts/recordings.
Instead of spreadsheets or relational databases, unstructured data is usually stored in data lakes, NoSQL databases, applications and data warehouses. The wealth of information in unstructured data is now accessible and can be automatically processed with artificial intelligence algorithms today. This technology has elevated unstructured data to an extremely valuable resource for organisations.
Beyond structured and unstructured data, there is a third category, which basically is a mix between both of them. The type of data defined as semi-structured data has some defining or consistent characteristics but doesn’t conform to a structure as rigid as is expected with a relational database. Therefore, there are some organisational properties such as semantic tags or metadata to make it easier to organise, but there’s still fluidity in the data. mail messages are a good example. While the actual content is unstructured, it does contain structured data such as name and email address of sender and recipient, time sent, etc. Another example is a digital photograph. The image itself is unstructured, but if the photo was taken on a smart phone, for example, it would be date and time stamped, geo tagged, and would have a device ID. Once stored, the photo could also be given tags that would provide a structure, such as ‘dog’ or ‘pet.’
A lot of what people would usually classify as unstructured data is indeed semi-structured, because it contains some classifying characteristics.
The Difference Between Structured, Unstructured, And Semi-Structured Data
To easily understand the differences between the classifications of data, let’s use this analogy to illustrate. When interviewing for a job, let’s say there are three different classifications of interviews: structured, semi-structured and unstructured.
In a structured interview, the interviewer follows a strict script that was defined by the human resources department and is followed for every candidate. Another form of interview is an unstructured interview. In an unstructured interview, it is entirely up to the interviewer to determine the questions and the order they will be asked (or even if they will be asked) for every candidate. A semi-structured interview takes elements from both structured and unstructured interview classifications. It uses the consistency and quantitative elements allowed with the structured interview but offers the freedom to customise based on the circumstances that are more in line with an unstructured interview.
So, for data, structured data is easily organizable and follows a rigid format; unstructured is complex and often qualitative information that is impossible to reduce to or organise in a relational database and semi-structured data has elements of both.