The entrepreneurial life under the AI data shortage: stealing GPT-4 to generate data training models has caused investors to worry

This article is machine translated
Show original

According to a report by The Information on April 15, in the field of AI, many chatbots developed by startups are actually based on data and technology provided by large companies such as OpenAI. These low-cost services can imitate the performance of GPT-4 and Llama to some extent, but this practice may violate the usage requirements of these technology giants. Not only that, this low-cost imitation may also threaten the market share and revenue of AI giants.

AI giants themselves are also not immune to copyright disputes, and some unauthorized use of data has caused a lot of controversy and lawsuits. Fortunately, the overall copyright awareness of the industry has changed, and OpenAI and Google have taken the lead in reaching data licensing agreements with publishers and websites.

In addition, in the current complex market competition, investors also have their own considerations. They hope to see rapid progress in the AI ​​industry, but are unwilling to support start-ups that cut corners in technology research and development because they are worried that these violations of the rules may have a negative impact on the long-term sustainability and reputation of start-ups.

1. A new way for AI companies to start businesses: using GPT-4 to generate content training models

Developers use OpenAI's most advanced model, GPT-4, as a resource to help accelerate their research and development process. They ask questions to the model to get insights and suggestions on specific issues. For example: What's wrong with this line of code? Then use the answers to improve their own models.

One founder who helps developers build conversational AI estimates that about half of his clients have generated some data from OpenAI’s GPT-4 or Anthropic’s Claude models and used that data to improve their own models.

Many developers don’t need to train models from scratch. Small-scale models are often developed based on popular open source models that are freely available, such as those from Meta or Mistral AI. These small-scale models are then significantly improved by incorporating the answers from the OpenAI model.

For some companies, the risk of violating written or unwritten rules may be worth it. In the competitive field of generative AI, obtaining high-quality data for training or refining models is crucial. Any AI startup knows that if it lacks data sources for training, it will fall behind.

Even large tech companies can’t resist the temptation of such “convenience.” According to The Times, examples include Google transcribing YouTube videos to train its AI models and Meta hiring African contractors to summarize copyrighted books to train AI models. In addition, Bloomberg reported a news from Adobe that they used AI-generated photos provided by the startup Midjourney to train their image generation software Firefly.

Last year, a senior AI engineer at Google resigned in protest after expressing concerns about the company's use of OpenAI's ChatGPT data to train Google's own models, The Information reported.

But some developers are reluctant to admit their use of open source models. Once such behavior is made public, their companies will be in an embarrassing situation. For example, Mistral AI in Paris and Zero One Everything in Beijing were forced to admit that they did use Meta's open source model Llama 2 as the basis for their product development after the information leak incident.

As more companies develop models that are derived from other models, they may become indistinguishable. This could erode the competitive advantage of leading companies such as OpenAI, which will compete on price as customers choose cheaper, more convenient models instead of the most advanced and expensive ones.

2. Altman relaxes restrictions on the use of ChatGPT, OpenAI was previously embroiled in copyright disputes

OpenAI, like other leading AI companies such as Anthropic and Google, technically prohibits such behavior. Still, OpenAI CEO Sam Altman mentioned in a conversation with startup founders at a conference that small business founders can use OpenAI's technology to a certain extent.

While Altman’s response relieved some of the founders present, they could change their minds at any time if such an approach hurts OpenAI’s development. It’s unclear how long OpenAI, Google, Anthropic, and other large developers will allow smaller competitors to effectively copy their AI.

Still, what the startups are doing with OpenAI’s data has similarities to what OpenAI and other leading AI developers do when training their own models. OpenAI’s chief technology officer, Mira Murati, was somewhat coy in an interview last month when asked whether her colleagues used data from Google’s YouTube and Meta Platforms’ Facebook and Instagram to train Sora.

It wouldn’t be surprising if OpenAI did use the data. A recent New York Times report described how OpenAI created the speech recognition tool Whisper to transcribe YouTube videos to improve its GPT-4 model. The Information previously reported that the company secretly used YouTube data to train its previous AI models. Earlier this month, YouTube CEO Neal Mohan said he would not agree to OpenAI using YouTube videos to develop models like Sora.

This has triggered accusations from news publishers and some writers. Last December, The New York Times sued OpenAI and its largest supporter Microsoft, accusing them of illegally copying New York Times articles when training models. The lawsuit claims that OpenAI's chatbot can generate output of complete New York Times content.

In its response, OpenAI argued that it had sought to forge partnerships with news publishers and that its training activities were permitted under the U.S. copyright doctrine of “fair use.”

Still, both OpenAI and Google have struck multimillion-dollar licensing deals with publishers including Axel Springer, and even larger agreements with large sites like Reddit.

But not every AI developer is operating in a gray area. Jonathan Frankle, chief scientist at Databricks, said the company did not rely on the work of competitors when developing powerful open-source large-scale language models. A spokesperson for Anthropic also said the company did not use the output of other models to train its own large models.

3. Investors don’t want startups to “take shortcuts”; synthetic data may become a new source of training

Some investors are uncomfortable with companies that “cut corners” or develop technology that is indistinguishable from their competitors’ when they don’t actually have their own technology. Investors prefer to see rapid progress in AI and better scientific research than their peers.

Some companies that have raised hundreds of millions of dollars in funding do not even admit to using open source models from other AI companies. This situation has exacerbated the dissatisfaction of investors, who believe that the company's integrity is questionable. Matt Murphy, managing director of Menlo Ventures, explained that this situation will occur in a new ecosystem without a clear set of rules.

Synthetic data is an alternative where companies can generate data with their own AI models instead of taking content from the internet. For example, Google and Meta say they use synthetic data to build models to solve geometry problems and generate computer code. Because AI can generate this type of data, it avoids many of the legal issues that come with using artificially generated content.

Meanwhile, dozens of AI startups are acquiring private data from industries such as health care and law firms to develop models for specific purposes.

Conclusion: Generative AI imitation controversy continues, OpenAI is tolerant

Many startups are developing big AI models that likely use data from OpenAI and other companies, even as they try to undercut OpenAI. This practice has become an open secret in the industry, leading to a competitive landscape where the technology is the same but the price is half the price.

While startups like OpenAI are tolerant of small-scale use, some companies still don't proactively disclose that they use other people's technology in their development, arguing that admitting it could bring risks to the company.

In any case, the shortage of data for training large models and the growing competitive pressure are still increasing. Synthetic data is still in the exploratory stage. We look forward to more cutting-edge model training and data acquisition from AI companies.

This article comes from the WeChat public account "Smart Things" (ID:zhidxcom), translated by Giraffe, edited by Li Shuiqing, and published by 36Kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments