In-depth research: Is it really feasible to build a decentralized AI model through crowdfunding?

09-20

This article is machine translated

Show original

Contents of this article

exist

Folding@home achieved a major milestone during the COVID-19 pandemic. The research project achieved 2.4 exaFLOPS of computing power, provided by 2 million volunteer devices worldwide.

This represented fifteen times the processing power of the world's largest supercomputers at the time, allowing scientists to simulate COVID protein dynamics at scale. Their work advanced our understanding of the virus and its pathogenesis, especially early in the epidemic.

Global distribution of Folding@home users, 2021

Crowdfunding computing resources to solve problems

Building on a long history of volunteer computing, Folding@home projects crowdfunding computing resources to solve large-scale problems. The idea gained widespread attention in the 1990s with SETI@home, a project that has brought together more than 5 million volunteer computers in the search for extraterrestrial life.

The idea has since been applied to a variety of fields, including astrophysics, molecular biology, mathematics, cryptography and gaming. In each case, the collective strength enhanced the capabilities of the individual projects well beyond what they could achieve individually. This drives progress and enables research to be conducted in a more open and collaborative manner.

Can crowdfunding model be used for deep learning?

Many people wonder if we can apply this crowdfunding model to deep learning. In other words, can we train a large neural network on the masses? Front-end model training is one of the most computationally intensive tasks in human history. As with many @home projects, the current costs are beyond the reach of only the largest players.

This could hinder future progress as we rely on fewer and fewer companies to find new breakthroughs. This also concentrates control of our AI systems in the hands of a few. No matter how you feel about the technology, this is a future worth watching.

Most critics dismiss the idea of decentralized training as incompatible with current training technology. However, this view is increasingly outdated. New technologies have emerged that reduce the need for communication between nodes, allowing efficient training on devices with poor network connectivity.

These technologies include DiLoCo, SWARM Parallelism, lo-fi, and decentralized training of basic models in heterogeneous environments. Many of them are fault-tolerant and support heterogeneous computing. There are also new architectures designed specifically for decentralized networks, including DiPaCo and the decentralized hybrid expert model.

We're also seeing a variety of cryptographic primitives begin to mature, enabling networks to coordinate resources on a global scale. These technologies support application scenarios such as digital currency, cross-border payments, and prediction markets. Unlike earlier volunteer projects, these networks can aggregate staggering amounts of computing power, often orders of magnitude larger than the largest cloud training clusters currently imagined.

Together, these elements form a new model training regularization. This formalization takes advantage of the world's computing resources, including the vast number of edge devices that can be used if wired together. This will reduce the cost of most training workloads by introducing new competition mechanisms. It can also unlock new forms of training, making model development collaborative and modular rather than siled and monolithic.

Models can obtain calculations and data from the public and learn on the fly. Individuals can own parts of the models they build. Researchers can also publicly share novel findings again without having to monetize their findings to cover high computing budgets.

This report examines the current state of large model training and associated costs. It reviews previous decentralized computing efforts—from SETI to Folding to BOINC—for inspiration in exploring alternative paths. The report discusses the historical challenges of decentralized training and turns to recent breakthroughs that may help overcome these challenges. Finally, it summarizes future opportunities and challenges.

Current status of front-end model training

The cost of front-end model training has become prohibitive for non-large players. This trend is not new, but in reality the situation is becoming more serious as front-end labs continue to challenge extension suite assumptions.

According to reports, OpenAI spent more than $3 billion on training this year. Anthropic predicts that by 2025, we will start training $10 billion, and $100 billion models are not too far away.

This trend leads to industry concentration as only a few companies can afford to participate. This raises a core policy question for the future - can we accept a situation where all leading AI systems are controlled by one or two companies? This also limits the rate of progress, which is evident in the research community, as smaller labs cannot afford the computing resources required to expand suites of experiments.

Industry leaders have mentioned this many times:

Meta’s Joe Spisak:

To really understand the capabilities of [model] architecture, you have to explore it at scale, and I think that's what's missing in the current ecosystem. If you look at academia -- there's a lot of brilliant people in academia, but they lack access to computing resources, and that becomes a problem because they have these great ideas but don't really have the tools to implement them at the required level. way.

Max Ryabinin from Together:

The need for expensive hardware puts a lot of pressure on the research community. Most researchers are unable to participate in large-scale neural network development because it is cost-prohibitive for them to conduct the necessary experiments. If we continue to increase the size of the model by scaling it up, we will eventually be able to develop it.

Francois Chollet from Google:

We know that large language models (LLMs) have not yet achieved artificial general intelligence (AGI). Meanwhile, progress toward AGI has stalled. The limitations we face with large language models are exactly the same limitations we faced five years ago. We need new ideas and breakthroughs.

I think the next breakthrough is likely to come from outside teams while all the big labs are busy training bigger big language models. Some are skeptical of these concerns, arguing that hardware improvements and cloud computing capital expenditures will solve the problem.

But this seems unrealistic. On the one hand, by the end of this decade, the number of FLOPs in the new generation of Nvidia chips will increase significantly, possibly reaching 10 times that of today's H100. This will reduce the price per FLOP by 80-90%.

Likewise, the total FLOP supply is expected to increase approximately 20-fold over the next decade, along with improvements in network and related infrastructure. All of this will increase training efficiency per dollar.

Source: SemiAnalysis AI Cloud TCO Model

At the same time, total FLOP demand will also rise significantly as labs look to further scale. If ten-year trends in training computation remain unchanged, front-end training FLOPs are expected to reach approximately 2e29 by 2030. Training at this scale would require approximately 20 million H100 equivalent GPUs, based on current training execution times and utilization.

Assuming there are still multiple front-end labs in this area, the total number of FLOPS required will be several times this number, as the overall supply will be divided among them. EpochAI predicts we'll need about 100 million H100-equivalent GPUs by then, about 50 times 2024 shipments. SemiAnalysis also made similar predictions, believing that front-end training demand and GPU supply will grow roughly simultaneously during this period.

Capacity conditions may become more stressful for a number of reasons. For example, this is often the case if manufacturing bottlenecks delay estimated shipping lead times. Or if we fail to produce enough energy to power the data center.

Or if we have trouble connecting those energy sources to the grid. Or if increasing scrutiny of capital spending ultimately leads to industry downsizing, among other factors. At best, our current approaches allow only a handful of companies to continue to push research forward, and that may not be enough.

Clearly, we need a new approach. This approach eliminates the need to continually expand suite data centers, capital expenditures, and energy consumption to find the next breakthrough, but instead efficiently leverages our existing infrastructure with the flexibility to scale suites as demand fluctuates. This will allow for more experimental possibilities in research, as training execution will no longer need to ensure return on investment for multi-billion dollar computing budgets.

Once free of this limitation, we can move beyond the current large language model (LLM) paradigm, as many believe is necessary to achieve artificial general intelligence (AGI). To understand what this alternative might look like, we can draw inspiration from past decentralized computing practices.

Crowd Computing: A Brief History

SETI@home popularized the concept in 1999, allowing millions of participants to analyze radio signals in search of extraterrestrial intelligence. SETI collects electromagnetic data from the Arecibo telescope, divides it into batches, and transmits it to users via the Internet. Users analyze data in their daily activities and send results back.

No communication is required between users, and batches can be audited independently, thus achieving a high degree of parallel processing. At its peak, SETI@home had over 5 million participants and had more processing power than the largest supercomputers of the time. It eventually closed in March 2020, but its success inspired the voluntary computing movement that followed.

Folding@home continued this idea in 2000, using edge computing to simulate protein folding in diseases such as Alzheimer's, cancer, and Parkinson's disease. Volunteers spend their free time on their PCs performing protein simulations, helping researchers study how proteins misfold and lead to disease. At various times in its history, its computing power exceeded that of the largest supercomputers of the time, including in the late 2000s and during COVID, when it became the first decentralized computing project to exceed one exaFLOPS. Since its inception, Folding researchers have published more than 200 peer-reviewed papers, each relying on the computing power of volunteers.

The Berkeley Open Infrastructure for Network Computing (BOINC) popularized this idea in 2002, providing a crowd-funded computing platform for various research projects. It supports multiple projects such as SETI@home and Folding@home, as well as new projects in areas such as astrophysics, molecular biology, mathematics and cryptography. By 2024, BOINC lists 30 ongoing projects and nearly 1,000 published scientific papers, all produced using its computing network.

Outside of scientific research, volunteer computing is used to train game engines such as Go (LeelaZero, KataGo) and chess (Stockfish, LeelaChessZero). LeelaZero was trained from 2017 to 2021 through volunteer computing, allowing it to play over 10 million games against itself, creating one of the most powerful Go engines available today. Similarly, Stockfish has been continuously trained on a volunteer network since 2013, making it one of the most popular and powerful chess engines.

About the challenges of deep learning

But can we apply this model to deep learning? Could we network edge devices around the world to create a low-cost public training cluster? Consumer hardware—from Apple laptops to Nvidia gaming graphics cards—is getting better and better at deep learning. In many cases, the performance of these devices exceeds the performance per dollar of data center graphics cards.

However, to effectively utilize these resources in a decentralized environment, we need to overcome various challenges.

First, current decentralized training techniques assume frequent communication between nodes.

Current state-of-the-art models have grown so large that training must be split across thousands of GPUs. This is achieved through a variety of parallelization techniques, typically splitting the model, data set, or both at the same time across the available GPUs. This usually requires a high-bandwidth and low-latency network, otherwise nodes will sit idle, waiting for data to arrive.

For example, distributed data parallelism (DDP) distributes the data set across GPUs, with each GPU training a complete model on its specific data fragment and then sharing its gradient updates to generate new model weights at each step. This requires relatively limited communication overhead, as nodes only share gradient updates after each backpropagation, and collective communication operations can partially overlap with computation.

However, this approach only works for smaller models because it requires each GPU to store the entire model's weights, enabled values, and optimizer state in memory. For example, GPT-4 requires over 10TB of memory during training, while a single H100 only has 80GB.

To address this issue, we also use various techniques to split the model for distribution across GPUs. For example, tensor parallelism splits individual weights within a single layer, allowing each GPU to perform the necessary operations and pass the output to the other GPUs. This reduces the memory requirements of each GPU, but requires constant communication between them, thus requiring high-bandwidth, low-latency connections for efficiency.

Pipeline parallelism distributes the layers of a model onto individual GPUs, with each GPU performing its work and sharing updates with the next GPU in the pipeline. Although this requires less communication than tensor parallelism, "bubbles" (e.g., idle times) may occur where GPUs at the back of the pipeline wait for information from the preceding GPUs in order to begin their Work.

To address these challenges, various technologies have been developed. For example, ZeRO (Zero Redundancy Optimizer) is a memory optimization technique that reduces memory usage by increasing communication overhead, allowing larger models to be trained on a specific device. ZeRO reduces memory requirements by splitting model parameters, gradients and optimizer state between GPUs, but relies on extensive communication so that the device can obtain the split data. It is the basis method for popular technologies such as Fully Sharded Data Parallel (FSDP) and DeepSpeed.

These techniques are often used in combination in large model training to maximize resource utilization efficiency, which is called 3D parallelism. In this configuration, tensor parallelism is often used to distribute weights across GPUs within a single server because of the large amount of communication required between each split layer.

Pipeline parallelism is then used to distribute tiers between different servers (but within the same island in the data center) because it requires less communication. Next, data parallelism or Fully Sharded Data Parallelism (FSDP) is used to split the data set between different server islands, as it can accommodate changes in the size of the data by asynchronously sharing updates and/or compressing gradients. Long network latency. Meta uses this combined approach to train Llama 3.1, as shown in the diagram below.

These approaches pose core challenges for decentralized training networks that rely on devices connected through the (slower and more volatile) consumer-grade Internet. In this environment, communication costs can quickly outweigh the benefits of edge computing because devices are often idle, waiting for data to arrive.

As a simple example, to train a half-precision model with 1 billion references in parallel using distributed data, each GPU needs to share 2GB of data in each optimization step. Taking a typical Internet bandwidth (such as 1 gigabit per second) as an example, assuming that computation and communication do not overlap, transmitting gradient updates takes at least 16 seconds, resulting in significant idle time. Techniques like tensor parallelism (which require more communication) will of course perform worse.

Second, current training techniques lack fault tolerance. Like any decentralized system, training clusters become more prone to failure as they increase in size. However, this problem is exacerbated in training because our current technology is mainly synchronous, which means that the GPUs must work together to complete the model training.

The failure of a single GPU among thousands of GPUs can cause the entire training process to stop, forcing other GPUs to start training from scratch. In some cases, the GPU does not fail completely, but becomes sluggish for various reasons, slowing down thousands of other GPUs in the cluster. Given the size of today's clusters, this could mean tens to hundreds of millions of dollars in additional costs.

Meta elaborated on these issues during their Llama training, in which they experienced over 400 unexpected interruptions, averaging about 8 interruptions per day. These outages are primarily attributed to hardware issues, such as GPU or host hardware failure. This results in their GPU utilization being only 38-43%. OpenAI performs even worse during the training process of GPT-4, only 32-36%, which is also due to frequent failures during the training process.

In other words, front-end labs still struggle to achieve 40% utilization when training in a fully optimized environment that includes homogeneous, state-of-the-art hardware, networking, power and cooling systems. This is primarily due to hardware failures and network issues, which are exacerbated in edge training environments because devices have imbalances in processing power, bandwidth, latency, and reliability. Not to mention, decentralized networks are vulnerable to malicious actors who may try to disrupt the overall project or cheat on specific workloads for various reasons. Even SETI@home, a purely volunteer network, has seen cheating by different participants.

Third, front-end model training requires large-scale computing power. While projects like SETI and Folding have reached impressive scale, they pale in comparison to the computing power required for front-end training today. GPT-4 was trained on a cluster of 20,000 A100s and achieved a peak throughput of 6.28 ExaFLOPS at half precision. That's three times more computing power than Folding@home had at its peak.

Llama 405b uses 16,000 H100s for training, with a peak throughput of 15.8 ExaFLOPS, which is 7 times the Folding peak. This gap will only widen further as multiple labs plan to build clusters of over 100,000 H100s, each with a staggering 99 ExaFLOPS of compute power.

This makes sense since the @home project is volunteer driven. Contributors donate their memory and processor cycles and bear the associated costs. This naturally limits their size relative to commercial projects.

recent developments

While these problems have historically plagued decentralized training efforts, they no longer appear insurmountable. New training technologies have emerged that reduce the need for communication between nodes, allowing for efficient training on Internet-connected devices.

Many of these technologies originate from large labs that want to add greater scale to model training and therefore require efficient communication technology across data centers. We are also seeing advances in fault-tolerant training methods and cryptographic incentive systems that can support larger-scale training in edge environments.

Efficient communication technology

DiLoCo is Google's recent research that reduces communication overhead by performing local optimizations before passing updated model state between devices. Their approach (based on earlier federated learning research) showed comparable results to traditional synchronous training while reducing the amount of communication between nodes by a factor of 500.

The approach has since been replicated by other researchers and expanded to train larger models (over 1 billion primers). It also extends the suite to asynchronous training, meaning nodes can share gradient updates at different times rather than all at once. This better accommodates edge hardware with varying processing capabilities and network speeds.

Other data parallel methods, such as lo-fi and DisTrO, aim to further reduce communication costs. Lo-fi proposes a completely local fine-tuning approach, which means nodes are trained independently and only the weights are passed on at the end. This approach achieves comparable performance to the baseline when fine-tuning language models with over 1 billion arguments while completely eliminating communication overhead.

In a preliminary report, DisTrO claims to employ a new decentralized optimizer that they believe can reduce communication requirements by four to five orders of magnitude, although the method has yet to be confirmed.

New model parallelism methods have also emerged, making it possible to achieve greater scale. DiPaCo (also from Google) divides the model into multiple modules, each module contains different expert modules to facilitate training for specific tasks. The training data is then sharded by "paths", which are expert sequences corresponding to each data sample.

Given a shard, each worker can train a specific path almost independently, except for the communication required to share modules, which is handled by DiLoCo. This architecture reduces the training time of a billion-prime model by more than half.

SWARM Parallelism and Decentralized Training of Base Models in Heterogeneous Environments (DTFMHE) also proposes model parallelism methods to achieve large model training in heterogeneous environments. SWARM found that as model size increases, pipeline parallelism communication constraints decrease, making it possible to efficiently train larger models at lower network bandwidth and higher latency.

To apply this concept in a heterogeneous environment, they use temporary "pipelines" between nodes that can be updated on the fly with each iteration. This allows the node to deliver its output to any peer node in the next pipeline stage.

This means that if a peer is faster than others, or if any participant becomes disconnected, outputs can be dynamically rerouted to ensure that training continues as long as there is at least one active participant in each phase. They used this approach to train a model with over 1 billion references on low-cost heterogeneous GPUs with slow interconnects (as shown in the image below).

DTFMHE also proposes a novel scheduling algorithm, as well as pipeline parallelism and data parallelism, to train large models on devices on 3 continents. Although their network speeds are 100 times slower than standard Deepspeed, their approach is only 1.7-3.5 times slower than using standard Deepspeed in the data center. Similar to SWARM, DTFMHE shows that communication costs can be effectively hidden as model size increases, even in geographically distributed networks. This allows us to overcome weak connections between nodes through various techniques, including increasing the size of hidden layers and adding more layers per pipeline stage.

fault tolerance

Many of the data parallel methods described above are fault tolerant by default because each node stores the entire model in memory. This redundancy usually means that nodes can still work independently even if other nodes fail. This is important for decentralized training, as nodes are often unreliable, heterogeneous, and may even behave maliciously. However, as mentioned before, purely data-parallel methods are only suitable for smaller models, so the model size is constrained by the memory capacity of the smallest node in the network.

To solve the above problems, some people have proposed fault-tolerant techniques suitable for model parallel (or hybrid parallel) training. SWARM responds to peer node failures by prioritizing stable peers with lower latency and rerouting tasks in pipeline stages in the event of a failure. Other approaches, such as Oobleck, take a similar approach by creating multiple "pipeline templates" to provide redundancy in response to partial node failures. Although tested in data centers, Oobleck's approach provides strong reliability guarantees that apply equally to decentralized environments.

We also saw some new model architectures (such as Decentralized Mixture of Experts (DMoE)) to support fault-tolerant training in decentralized environments. Similar to traditional expert hybrid models, DMoE consists of multiple independent "expert" networks distributed across a set of worker nodes.

DMoE uses a distributed hash table to track and integrate asynchronous updates in a decentralized manner. This mechanism (also used in SWARM) is well resistant to node failures, since it can exclude certain experts from the average calculation if some nodes fail or fail to respond in time.

scale

Finally, cryptographic incentive systems like those employed by Bitcoin and Ethereum can help achieve the required scale. Both networks crowdfund their computing by paying contributors a native asset that increases in value as adoption grows. This design incentivizes early contributors by giving them generous rewards, which can be gradually reduced when the network reaches a minimum feasible size.

Indeed, there are various pitfalls with this mechanism that need to be avoided. The main pitfall is over-stimulating supply and failing to generate corresponding demand. Additionally, this could raise regulatory issues if the underlying network is not decentralized enough. However, when designed properly, decentralized incentive systems can achieve considerable scale over an extended period of time.

For example, Bitcoin's annual power consumption is about 150 terawatt hours (TWh), which is two orders of magnitude higher than the power consumption of the largest AI training cluster currently conceived (100,000 H100s executed at full load for one year).

For reference, OpenAI's GPT-4 was trained on 20,000 A100s, and Meta's flagship Llama 405B model was trained on 16,000 H100s. Likewise, at its peak, Ethereum's power consumption was approximately 70 TWh, spread across millions of GPUs. Even allowing for the rapid growth of AI data centers in the coming years, incentivized computing networks like these will exceed their scale many times.

Of course, not all computation is fungible, and training has unique requirements relative to mining that need to be considered. Nonetheless, these networks demonstrate the scale that can be achieved through these mechanisms.

The road ahead

Tying these pieces together, we can see the beginnings of a new path forward.

Soon, new training technologies will allow us to move beyond the confines of data centers, as devices no longer need to be co-located to be effective. This will take time because our current decentralized training methods are still at a smaller scale, mainly in the range of 1 billion to 2 billion citations, much smaller than models like GPT-4.

Further breakthroughs are needed to increase the scale of these methods without sacrificing key properties such as communication efficiency and fault tolerance. Or we need new model architectures that are different from today's large monolithic models - perhaps smaller, more modular, and executed on edge devices rather than in the cloud

In any case, it is reasonable to expect further progress in this direction. The costs of our current methods are unsustainable, which provides strong market incentives for innovation. We're already seeing this trend, with manufacturers like Apple building more powerful edge appliances to perform more workloads locally rather than relying on the cloud.

We're also seeing increasing support for open source solutions - even within companies like Meta, to promote more decentralized research and development. These trends will only accelerate over time.

At the same time, we also need new network infrastructure to connect edge devices to be able to use them in this way. These devices include laptops, gaming desktops, and eventually perhaps even mobile phones with high-performance graphics cards and large amounts of memory.

This will allow us to build a "global cluster" of low-cost, always-on computing power that can process training tasks in parallel. It is also a challenging problem that requires progress in multiple areas.

We need better scheduling techniques for training in heterogeneous environments. There is currently no way to automatically parallelize a model for optimization, especially when devices can be disconnected or connected at any time. This is a critical next step in optimizing training while retaining the scale advantages of edge-based networks.

We also have to deal with the general complexities of decentralized networks. To maximize scale, the web should be built as an open protocol—a set of standards and instructions that govern interactions between participants, like TCP/IP but for machine learning computing. This will enable any device that adheres to certain specifications to connect to the network, regardless of owner and location. It also ensures that the network remains neutral, allowing users to train the models they prefer.

While this maximizes scale, it also requires a mechanism to verify the correctness of all training tasks without relying on a single entity. This is critical because there are inherent incentives to cheat — for example, claiming to have completed a training task to get paid, but not actually doing so. This is particularly challenging given that different installations often perform machine learning operations differently, making it difficult to verify correctness using standard replication techniques. Correctly solving this problem requires in-depth research in cryptography and other disciplines.

Fortunately, we continue to see progress on all of these fronts. These challenges no longer seem insurmountable compared to years past. They also pale in comparison to the opportunities. Google summarizes this best in their DiPaCo paper, pointing out the negative feedback mechanism that decentralized training has the potential to break:

Advances in decentralized training of machine learning models may facilitate the simplified construction of infrastructure, ultimately leading to wider availability of computing resources. Currently, the infrastructure is designed around standard methods for training large monolithic models, and machine learning models are architected to leverage current infrastructure and training methods. This feedback loop can trap the community into a misleading regional minimum, where computing resources are more constrained than actually needed.

Perhaps most exciting is the growing enthusiasm among the research community to address these questions. Our team at Gensyn is building the network infrastructure described above. Teams like Hivemind and BigScience apply many of these techniques in practice.

Projects like Petals, sahajBERT, and Bloom demonstrate the capabilities of these technologies and the growing interest in community-based machine learning. Many others are also driving research forward, with the goal of building a more open and collaborative model training ecosystem. If you are interested in this work, please contact us to get involved.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content