Title: My Data is Not Mine: The Emergence of Data Layers
Author: 0xJeff (@Defi0xJeff)
Compiled by: Asher (@Asher_0210)
Due to the current high level of online attention, data is the digital gold of this era. The global average screen time in 2024 is 6 hours and 40 minutes per day, which has increased compared to previous years. In the US, this figure is even higher, reaching 7 hours and 3 minutes per day.
With such high participation, the amount of data generated is staggering, reaching 3.2877 TB per day in 2024. Converted, considering all newly generated, captured, replicated or consumed data, it is approximately 0.4 ZB of data per day (1 ZB = 1,000,000,000 TB).
However, despite the large amount of data generated and consumed daily, users own very little:
Social media: Data on platforms like X and Instagram is controlled by companies, even though it is user-generated;
IoT: Data from smart devices usually belongs to the device manufacturer or service provider, unless there are specific agreements;
Health data: While individuals have rights over their medical records, most data from health apps or wearables is controlled by the providers of these services.
Crypto and Social Data
In the crypto space, we have seen the rise of Kaito AI, which indexes social data on the X platform and converts it into actionable sentiment data for projects, KOLs, and thought leaders. The terms "yap" and "mindshare" were promoted by the Kaito team due to their expertise in growth hacking (through their popular mindshare and yapper dashboards) and their ability to attract organic interest on Crypto Twitter.
"Yap" aims to incentivize the creation of high-quality content on the X platform, but many questions remain unanswered:
How are "yaps" accurately scored?
Do mentions of Kaito earn additional 'yaps'?
Is Kaito truly rewarding high-quality content, or is it more biased towards controversial popular opinions?
In addition to social data, discussions about data ownership, privacy, and transparency are becoming increasingly heated. With the rapid development of AI, new questions arise: Who owns the data used to train AI models? Who can benefit from the results generated by AI? These questions pave the way for the emergence of Web3 data layers - a step towards a decentralized, user-driven data ecosystem.
The Emergence of Data Layers
In the Web3 space, a growing ecosystem of data layers, protocols, and infrastructure is forming, aiming to achieve personal data sovereignty, empower individuals to better control their data, and provide monetization opportunities.
Vana
Vana's core mission is to give users control over their data, especially in the context of AI, where data is invaluable for training models. Vana has launched DataDAOs, which are community-driven entities where users pool their data to achieve common interests. Each DataDAO focuses on a specific data set:
r/datadao: Focused on Reddit user data, enabling users to control and monetize their contributions;
Volara: Handles X platform data, allowing users to benefit from their social media activities;
DNA DAO: Aims to manage genetic data with a focus on privacy and ownership.
Vana tokenizes data into a tradable asset called "DLP". Each DLP aggregates data from a specific domain, and users can stake tokens into these pools to earn rewards, with top pools receiving rewards based on community support and data quality. Vana's standout feature is the simplicity of data contribution. Users only need to select a DataDAO, then aggregate their data directly via API integration or manual upload, and finally earn DataDAO tokens and VANA tokens as rewards.
Ocean Protocol
Ocean Protocol is a decentralized data marketplace that allows data providers to share, sell or license their data, while consumers can access this data for AI and research purposes. Ocean Protocol uses "datatokens" (ERC 20 tokens) to represent access rights to data sets, allowing data providers to monetize their data while maintaining control over access conditions.
The types of data traded on the Ocean Protocol include:
Public data, such as open data sets like weather information, public demographics, or historical stock data, which are valuable for AI training and research;
Private data, including medical records, financial transactions, IoT sensor data, or personalized user data, which require strict privacy controls.
Compute-to-Data is another key feature of Ocean Protocol, allowing computations to be performed on data without moving the data, ensuring the privacy and security of sensitive data sets.
Masa
Masa focuses on creating an open layer for AI training data, providing real-time, high-quality, and low-cost data for AI agents and developers.
Masa has launched two subnets on the Bittensor network:
Subnet 42 (SN42): Aggregates and processes millions of data records daily, providing a foundation for AI agents and application development;
Subnet 59 (SN59) - "AI Agent Arena": A competitive environment where AI agents leverage real-time data from SN42, competing for TAO release based on metrics like mind share, user engagement, and self-improvement.
Additionally, Masa collaborates with Virtuals Protocol to provide real-time data capabilities for Virtuals Protocol agents. It has also launched the TAOCAT token, showcasing its capabilities (currently on Binance Alpha).
Open Ledger
Open Ledger is building a blockchain specifically designed for data, particularly for AI and machine learning applications, ensuring secure, decentralized, and verifiable data management, with highlights including:
Datanets: An internal network of curated data sources within OpenLedger, enriching and providing real-world data for AI applications;
SLMs: Customized AI models for specific industries or applications. The idea is to provide models that are not only more accurate in niche use cases, but also adhere to privacy requirements and are less prone to the biases present in general-purpose models;
Data Validation: Ensuring the accuracy and reliability of the data used to train specific language models (SLMs), making them accurate and reliable for specific use cases.
AI Training's Demand for Data
The demand for high-quality data is surging to drive the development of AI and autonomous agents. In addition to initial training, AI agents also require real-time data for continuous learning and adaptation, with key challenges and opportunities being:
Quality over Quantity: AI models need high-quality, diverse, and relevant data to avoid biases or poor performance;
Data Sovereignty and Privacy: As shown by Vana, the monetization of user-owned data is being driven, which may reshape the way AI training data is acquired;
Synthetic Data: With increasing privacy concerns, synthetic data is gaining attention as a way to train AI models while mitigating ethical issues.
Data Market: The rise of data markets (centralized and decentralized) is creating an economy where data is a tradable asset;
Artificial Intelligence in Data Management: Artificial intelligence is now being used to manage, clean, and enhance data sets, improving the quality of data for AI training.
As AI agents become more autonomous, their ability to access and process real-time, high-quality data will directly impact their performance. This increasing demand has given rise to data markets specifically designed for AI agents, where both AI agents and humans can access high-quality data.
Web3 Agent Data Markets
Cookie DAO aggregates social sentiment data for AI agents as well as token-related information, converting it into insights that can be leveraged by both humans and AI agents. The Cookie DataSwarm API enables AI agents to access real-time, high-quality data to gain insights related to trading, one of the most common applications in the crypto space. Additionally, Cookie has 200,000 monthly active users and 20,000 daily active users, making it one of the largest AI agent data markets, with the COOKIE Token as its core.
Finally, other notable projects in this space include:
GoatIndex.ai, which focuses on insights into the Solana ecosystem;
Decentralised.Co, which focuses on niche data dashboards, such as GitHub and project-specific analytics.