SN 33: Contributing high-quality datasets to open source AI

This article is machine translated
Show original

Data is the oil of the AI era, and the evolution of AI models is inseparable from massive and high-quality datasets. However, the development of open-source AI models is often limited by the lack of high-quality datasets. Closed-source AI developers, in order to reduce data collection costs, have many workers engaged in high-intensity mental labor, but they can only get a reward of less than $2 per hour. The benefits brought by these models are concentrated in the hands of a few people, exacerbating the inequality among contributors.

In the Bittensor ecosystem, Subnet 33 is also committed to solving the problem of lack of high-quality datasets. So, how does SN 33 specifically operate? How is its current performance?

Subnet 33 ReadyAI

Emission: 2.51% (2024-10-13)

Github: https://github.com/afterpartyai/bittensor-conversation-genome-project

Team: The team behind SN33 is from Afterparty AI, a startup founded in 2021, and received $5 million in funding from Blockchange Ventures in September 2023.

Staked $TAO amount by Root Network validators on SN 33 (Amount = Validator's total staked * Validator's weight on SN 33)

The Goal

SN33 aims to provide a low-cost, resource-minimized data structuring and semantic labeling process for individuals or enterprises. To achieve this goal, SN33 has innovated in the annotation and structuring of text data, transforming massive raw conversation data into structured data that can be adopted by AI applications.

The Execution

SN33 cleverly combines the method of fractal data mining with the Validator-Miner architecture of Bittensor, in order to obtain a more complete and reliable structured dataset.

https://github.com/afterpartyai/bittensor-conversation-genome-project?tab=readme-ov-file#introduction-to-readyai

The specific process includes:

  1. Validator:
  • Pulls a segment of raw conversation data to be annotated from a self-set data store or CGP API
  • Annotates the raw conversation data
  • Splits the raw data into multiple overlapping short data and distributes them to Miners

2. Miner:

  • Uses LLMs to process the short data, generating labels, participant profiles, and vector embeddings for each semantic label
  • Pushes the metadata back to the Validator

3. Validator:

  • Compares the annotation of the raw conversation data as a factual benchmark, and scores the output of the Miners
  • Pushes all metadata back to the data store or CGP API

This approach not only improves the efficiency of data processing, but also enhances the robustness of the data through cross-validation, preventing a single error or inaccurate result from having a significant impact on the overall dataset.

The Product

ReadyAI is a tool platform built on SN33, targeting AI application developers. Through the services of ReadyAI, AI developers can transform the raw data they want to adopt into structured data, thereby optimizing their product experience.

https://conversations.xyz/

For example, the website provides a Demo for the Docs Wizards scenario, where users can directly chat with the AI avatar of the Afterparty CEO to learn about SN33.

Super Dave AI Chat

In addition, for more diverse scenarios, it also supports AI developers to customize chatbots that meet their needs through the Personas API.

An example of Personas API

The Update

ReadyAI announced a new progress on September 12, 2024, claiming that the performance of the top Miners on SN 33 in processing data far exceeds the level of manual annotation on Amazon's crowdsourcing platform Mechanical Turk (MTurk), and even surpasses GPT-4o, with significantly lower costs.

This experiment selected 1270 dialogue samples, used the models of the Top 5 Miners on SN 33 for annotation, and compared the performance with MTurk workers and GPT-4o. The results show that the annotation accuracy of the Miners is 71% higher than MTurk and 37% higher than GPT-4o. Moreover, the annotation cost of the Miners is far lower than manual labor, about 1/660 of MTurk.

This experiment further supports the competitive advantage of LLMs in data annotation tasks, and the services output by SN 33 are more advanced than GPT-4o in this regard.

The Conclusion

High-quality datasets are an indispensable part of AI model training and fine-tuning. SN 33 provides high-quality, customizable datasets at low cost, which is very valuable for the development of open-source AI models. Especially for small and medium-sized enterprises, this affordable annotation solution can help them acquire quality structured data at a lower cost, thereby promoting AI applications and automation, and enhancing their competitiveness. This innovation allows more companies to participate in the development of AI and benefit from it.

Medium
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments