Synthetic Data: Solution or Risk for the AI Industry?

This article is machine translated
Show original

The field of artificial intelligence is increasingly using synthetic data, but is this a sustainable path?

Currently, many websites have blocked the data collection tools of AI companies. According to Epoch AI, if this trend continues, AI training data may be depleted between 2026 and 2032. In this context, artificial intelligence (AI) companies such as Anthropic, Meta and OpenAI have started using synthetic data to train their models, such as Claude 3.5 Sonnet, Llama 3.1 and Orion.

This not only helps reduce the cost and time of data collection, but also expands the ability to create rich datasets without relying on real data. Synthetic data plays a crucial role in training AI, especially in data labeling, a key factor that helps models recognize and predict more accurately.

The synthetic data market is expected to reach $2.34 billion by 2030, and Gartner predicts that 60% of the data used for AI and analytics this year will be synthetically generated. However, over-reliance on synthetic data also brings challenges in terms of data quality and diversity.

Research from Rice University and Stanford shows that AI models can gradually lose quality and diversity if they rely solely on synthetic data. In addition, the AI industry also faces the problem of data bias when synthetic data may reflect the limitations and biases of the original data. Models trained on flawed data will create more flawed data, creating a negative feedback loop.

The eternal problem when reusing AI data. Source: Ilia Shumailov et al.

Luca Soldaini, a senior research scientist at the Allen Institute for AI, argues that "raw" synthetic data is not reliable. Using them safely requires careful consideration, organization and filtering, and ideally combining them with new real-world data.

However, to fully leverage the benefits of synthetic data, the technology industry needs to continue researching and developing methods to ensure data quality, while also addressing workforce issues to achieve sustainable development.

Although OpenAI CEO Sam Altman once predicted that AI will be able to generate synthetic data good enough to self-train, this technology has not yet emerged. The industry needs to balance the benefits and risks to optimize the potential of AI in the future.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments