You probably know that the new generation of generative AI tools that have exploded onto the scene can generate words, pictures and even videos that closely resemble those created by humans. But did you know that it can also be used to generate data itself?
Modern artificial intelligence (AI) works by recognizing patterns in data and using it to answer questions or predict what comes next. In the case of generative AI like Open AI‘s ChatGPT, it uses it to create more data that follows the rules of the data it’s trained on.
But real data comes with complications – it can be difficult and expensive to collect and brings security and privacy obligations.
Think about a dataset comprising thousands of human faces, for example – as used to train facial recognition algorithms. You have to find and photograph thousands of people and then get their permission to store and use their data. Then, myriad checks and balances must be followed to ensure your data isn't harmfully biased.
One solution is synthetic data. This is data created by machines and closely resembles real-world data that can be used for many of the same purposes.
Snowflake is one of the world's biggest "data-as-a-service" companies that, in addition to their analytics services, also offers a data marketplace covering thousands of topics, including healthcare, finance and retail.
Now, it’s augmenting these offerings with synthetic, AI-generated datasets and putting generative AI to use in several other interesting applications. Let's take a look!
First, What Is Synthetic Data?
Synthetic data is information that has been artificially generated in order to have the same characteristics as a real-world dataset but without including any real-world data.
Generative AI is particularly suited to this task as it can easily analyze any dataset and then create synthetic data that closely matches it. It means businesses can train AI algorithms and perform tests and simulations without exposing private or sensitive information that might be contained in real-world data.
It’s used in finance to train fraud detection algorithms to spot deliberately falsified transactions, in healthcare to avoid using sensitive patient data, and in retail and marketing to create synthetic customers and analyze their buying behavior.
According to Gartner research, business leaders are most likely to turn to synthetic data because of difficulties with accessibility, complexity and availability of real-world data. It also found that partially synthetic datasets – where real-world data is augmented with synthetic data – are more commonly used than fully synthetic datasets.
By generating synthetic data, companies can create any information they need to plug gaps in existing records or create entirely new datasets. It doesn’t negate the need for real-world data, which is needed to create synthetic data in the first place. But when used effectively, it can reduce the cost, speed up the training of machine learning models, and help businesses automate and make better decisions.
Generative Synthetic Data At Snowflake
Snowflake sells data to businesses via its Snowflake marketplace, which is one of the largest B2B data brokerages in the world.
Alongside its thousands of real-world datasets, Snowflake now offers access to synthetic datasets created by generative AI algorithms. One example is San Francisco-based Synthesis AI’s synthetic human face dataset, comprising 5,000 individual images of diverse human faces.
In the past, facial recognition algorithms have been criticized and even banned due to concerns over biases in the datasets used to train them. This has led to differences in their ability to identify people of different ethnic backgrounds and accusations that they could be unfair or prejudiced.
Using synthetic data in this way can help to tackle those problems (note – I will not say it solves them entirely) as datasets can be created in line with whatever level of representation or inclusiveness is needed.
While synthetic data existed before the emergence of generative AI, the new class of generative algorithms means that datasets can quickly be scaled to any size that's needed. Datasets created in this way can also be easily customized to fit the needs of different customers around the world.
It also offers synthetic financial data from Clearbox AI, consisting of simulated mortgage applications designed to mimic both legitimate and fraudulent applications. The data in these sets had been augmented by data created by generative AI.
Snowflake has made it clear that it expects synthetic data generated by AI to play an important role in its business going forward. As generative models such as large language models (LLMs) become more sophisticated, we will see them becoming capable of creating synthetic data that more and more accurately reflects the real world, leading to cheaper and more efficient insights for businesses.
How Else is Generative AI Used at Snowflake?
As well as offering access to AI-generated synthetic data, Snowflake has created a number of tools based on generative AI for its customers to use.
Thanks to its acquisition this year of Neeva – a search startup founded by former employees of Google- it is implementing natural language querying of its datasets. Effectively, this will let users talk to their data, getting insights by asking straightforward questions rather than running traditional data science analysis. CEO Frank Slootman told VentureBeat, "Engaging with data through natural language is becoming popular … this will increase our opportunity to allow non-technical users to extract value from their data.”
It has also launched a partnership with Nvidia, using the chip maker’s NeMo LLM to create a platform that lets Snowflake users build generative AI applications like Chatbots and search engines with the ability to access Snowflake data.
Another LLM initiative is creating its Document AI tool that allows users to query documents – legal contracts or invoices, for example – and extract meaning for them. This was developed with technology that Snowflake acquired when it bought the Swedish natural language platform Applica in 2022.
Altogether, it's clear that Snowflake has big hopes for generative AI to create synthetic data and build tools to help us analyze and extract value from it.