Big Tech’s Data Addiction Is Breaking AI

Big Tech’s Data Addiction Is Breaking AI

Meta’s LLaMA-4 was launched with high expectations. Instead, it disappointed. Compared to its predecessor, it delivered weaker reasoning, more hallucinations, and overall diminished performance. According to D-GN’s CEO Johanna Cabildo, the reason wasn’t a lack of compute or innovation—it was data.

Having exhausted the internet’s supply of clean, diverse, and high-quality text, Meta turned to synthetic data: AI-generated content used to train newer AI. This creates a loop where models learn from themselves, losing accuracy and depth with each cycle.

Other major players—OpenAI, Google, Anthropic—face the same dilemma. The age of abundant, real-world training data has ended. What’s left is synthetic filler. As a result, progress is stalling, and the illusion of advancement is masking a quiet decline.

Who Owns the Data?

The 2024 Stanford AI Index reported that eight companies now control 89% of global AI training data and infrastructure. This isn’t just about market power. It affects what knowledge is embedded in AI and whose perspectives are excluded.

Models trained on biased or narrow datasets can reinforce real-world harm. AI tools built on American healthcare records misdiagnose patients in other countries. Hiring systems penalize applicants with non-Western names. Facial recognition is less accurate on darker skin, particularly for women. Filters silence minority dialects as offensive or irrelevant.

As models lean more heavily on synthetic data, the errors worsen. Researchers warn of recursive loops that produce “polished nonsense”—text that sounds correct but contains fabricated facts. By early 2025, the Columbia Journalism Review found Google Gemini only gave fully accurate citations 10% of the time. The more these systems train on their own flawed outputs, the faster they decay.

Locked In, Locked Out

AI companies built their models on the backbone of publicly available knowledge—books, Wikipedia, forums, and even news articles. But now, the same firms are walling off their models and monetizing access.

In late 2023, The New York Times sued OpenAI and Microsoft over unauthorized use of its content. Meanwhile, Reddit and Stack Overflow entered exclusive licensing deals, giving OpenAI access to user-generated content previously open to all.

This strategy is clear: harvest free public knowledge, monetize it, and lock it behind APIs. The same companies that benefitted from open ecosystems now restrict access while promoting synthetic data as a sustainable alternative—despite the mounting evidence that it degrades model performance. AI can’t evolve by learning from itself. There’s no insight in a mirror.

A Different Path

Fixing AI’s data crisis doesn’t require more compute or bigger models—it requires a shift in how data is collected, valued, and governed.

Web3 technologies offer one possible way forward. Blockchain can track where data comes from. Tokenized systems can fairly compensate people who contribute their knowledge. Projects like Morpheus Labs have used these tools to improve Swahili language AI performance by 30%, simply by incentivizing community input.

Privacy-preserving tools like zero-knowledge proofs add another layer of trust. They make it possible to train models on sensitive information—like medical records—without exposing private data. This ensures that models can learn ethically while still delivering high performance.

These ideas aren’t speculative. Startups are already using decentralized tools to build culturally accurate, privacy-respecting AI systems around the world.

Reclaiming the Future

AI is shaping the systems that shape society—education, medicine, work, and communication. The central question is no longer whether AI will dominate but who controls what it becomes.

Will we allow a handful of companies to recycle their own outputs, degrade model quality, and entrench bias? Or will we invest in building a new kind of data ecosystem—one that values transparency, fairness, and shared ownership?

The problem is not that machines don’t have enough data. The problem is that the data they’re using is increasingly synthetic, narrow, and controlled. The solution is to return power to the people who create meaningful content—and reward them for it. Better AI starts with better data. And better data starts with us.

The post Big Tech’s Data Addiction Is Breaking AI appeared first on Metaverse Post.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments