Artificial intelligence, “abandoning” real data sets?

Artificial intelligence, “abandoning” real data sets?

Currently, artificial intelligence technology has been applied to all aspects of our daily lives, such as face recognition, voice recognition, virtual digital humans, etc.

But a common problem is that if researchers want to train a machine learning model to perform a specific task (such as image classification), they often need to use a large amount of training data, but this data (set) is not always easy to obtain.

For example, if researchers are training a computer vision model for a self-driving car, the real data may not include samples of a person and his dog running on the highway. Once encountered, the model will not know what to do, which may lead to unnecessary consequences.

Moreover, generating datasets using existing data can cost millions of dollars.

Additionally, even the best datasets often contain biases that negatively impact model performance.

So, since it is so expensive to obtain and use a dataset, is it possible to use artificially synthesized data for training while ensuring model performance?

Recently, a study from a research team from the Massachusetts Institute of Technology (MIT) showed that an image classification machine learning model trained with synthetic data can be comparable to or even better than a model trained with real data.

The related research paper is titled "Generative models as a data source for multiview representation learning" and was published as a conference paper at ICLR 2022.

Not lost to real data

This particular machine learning model is called a generative model. Compared to datasets, it requires much less memory to store or share, and not only does it avoid some issues about privacy and usage rights, but it also does not have some of the biases and racial or gender issues that exist in traditional datasets.

According to the paper, during the training process, the generative model first obtains millions of images containing specific objects (such as cars or cats), then learns the appearance of cars or cats, and finally generates similar objects.

In simple terms, the researchers used a pre-trained generative model to output a large stream of unique, realistic images based on the images in the model training dataset.

(Source: Pixabay)

The researchers say that once a generative model is trained on real data, it can generate synthetic data that is almost indistinguishable from real data.

In addition, the generative model can be further expanded based on the training data.

If a generative model is trained on images of cars, it can “imagine” what a car looks like in different situations and then output images of cars with different colors, sizes, and states.

One of the many advantages of generative models is that they can theoretically create an infinite number of samples.

Based on this, the researchers tried to figure out how the number of samples affects model performance. The results showed that in some cases, a large number of unique samples does bring additional improvements.

And, in their opinion, the coolest thing about generative models is that we can find and use them in online repositories, and we can get good performance without intervening in the model.

But generative models also have some drawbacks. For example, in some cases, they may reveal the source data, posing privacy risks, and if not properly audited, they may amplify biases in the datasets they were trained on.

Is Generative AI the Trend?

The scarcity of effective data and sampling bias have become key bottlenecks in the development of machine learning.

In recent years, in order to solve this problem, Generative AI has become one of the hot topics in the field of artificial intelligence and has been given high expectations by the industry.

At the end of last year, Gartner released the important strategic technology trends for 2022, calling generative AI "one of the most compelling and powerful artificial intelligence technologies."

According to Gartner, generative AI is expected to account for 10% of all generated data by 2025, up from less than 1% today.

Figure|Gartner's important strategic technology trends in 2022 (Source: Gartner official website)

In 2020, generative AI was first proposed as a new technology hotspot in the "Hype Cycle for Artificial Intelligence, 2020" released by Gartner.

In the latest “Hype Cycle for Artificial Intelligence, 2021” report, generative AI appears as a technology that will mature in 2-5 years.

(Source: Gartner Hype Cycle for Artificial Intelligence, 2021)

The breakthrough of generative AI is that it can learn from existing data (images, texts, etc.) and generate new, similar original data. In other words, it can not only make judgments, but also create, and can be used for automatic programming, drug development, visual arts, social interaction, business services, etc.

However, generative AI can also be abused for scams, fraud, political rumors, identity fraud, etc., such as Deepfakes, which often generate various negative news.

So the question is, if we have a good enough generative model, do we still need a real dataset?

Original link:

https://openreview.net/pdf?id=qhAeZjs7dCL

https://news.mit.edu/2022/synthetic-datasets-ai-image-classification-0315

https://www.gartner.com/en/documents/4004183

Academic headlines

<<:  What exactly is the mysterious creature “Water Monkey”?

>>:  World Autism Day: Caring for “Children from the Stars”

Recommend

Bone conduction headphones are so expensive, but they can only make noise?

Rong used to always say, Human genes think Being ...

Why did Alipay copy WeChat’s red envelope function?

[[140409]] On July 8, Alipay released the latest ...

Case! “User incentive” routine fails? See how others do it

Now that you have users, what next? In the past t...

Why did Typhoon Kanu suddenly make a U-turn?

In recent days, news about "Typhoon Kanu may...

Pop Mart goes public, taking stock of the top 5 blind box marketing strategies!

On December 11, 2020, Pop Mart (09992.HK) was lis...

How much does it cost to make Baidu entries for an enterprise?

For users, Baidu Encyclopedia is an online encycl...

Can the fast and somewhat brutal Xiaomi 5 still make users get to the climax?

After 19 months of polishing, Xiaomi Mi 5 was fin...

Event planning and promotion: HOOK model application event planning

This article will focus on the entertainment aspe...

IDC 1U server rental hosting price

IDC 1U server rental hosting price. One of the fa...

A 50s geek's adventurous life with radio

[Key points] Zeng Dejun, born in the 1950s, was bo...