AI Training and Inference Tech Stack: From Silicon to Sentience

The rapid advancement of artificial intelligence has been underpinned by a complex technological infrastructure. This AI tech stack, a layered architecture of hardware and software, forms the backbone of today’s AI revolution. Here, we delve into the main layers of the stack and explain how each of them contributes to AI development and implementation. At the end we reflect on the importance of understanding these primitives in the context of evaluating opportunities in the crypto and AI intersection, especially DePin (decentralized physical infrastructure) projects such as GPU networks.

Hardware Layer: The Silicon Foundation

At the base lies the hardware, the physical compute power driving AI.

CPUs (Central Processing Units) are the foundational processors in computing. They excel at sequential tasks and are crucial for general-purpose computing, including data preprocessing, small-scale AI tasks, and coordinating other components.

GPUs (Graphics Processing Units) were originally designed for rendering graphics, GPUs have become essential for AI due to their ability to perform many simple calculations simultaneously. This parallel processing capability makes them ideal for training deep learning models and without the advancement in GPU the modern GPTs wouldn’t be possible.

AI Accelerators are specialized chips designed specifically for AI workloads. They optimize common AI operations, offering high performance and energy efficiency for both training and inference tasks.

FPGAs (Field-Programmable Gate Arrays) offer flexibility through their reprogrammable nature. They can be optimized for specific AI tasks, particularly in inference scenarios where low latency is crucial.

Low-Level Software: The Intermediaries

This layer of the AI tech stack is crucial as it bridges the gap between high-level AI frameworks and the underlying hardware. CUDA, ROCm, OneAPI, and SNPE facilitate the contact between high-level frameworks and specific hardware architectures, enabling optimized performance.

CUDA, NVIDIA’s proprietary software layer, stands as the cornerstone of the company’s remarkable ascendancy in the AI hardware market. NVIDIA’s dominance is not merely a function of superior hardware, but rather a testament to the power of network effects with respect to its software and resulting ecosystem integration.

CUDA’s influence comes from its deep entrenchment in the AI tech stack, offering an extensive array of optimized libraries that have become de facto standards in the field. This software moat has created a formidable network effect: AI researchers and developers, well-versed in CUDA during their training, propagate its use in both academia and industry.

The resulting virtuous cycle reinforces NVIDIA’s market leadership, as the ecosystem of CUDA-based tools and libraries becomes increasingly indispensable to AI practitioners.

This software-hardware symbiosis has not only cemented NVIDIA’s position at the forefront of AI computing but has also endowed the company with significant pricing power, a rare feat in the typically commoditized hardware market.

The dominance of CUDA and the relative obscurity of its competitors can be attributed to a confluence of factors that have created significant barriers to entry. NVIDIA’s first-mover advantage in the GPU-accelerated computing space allowed CUDA to establish a robust ecosystem before rivals could gain a foothold. Despite some of the competitors such as AMD and Intel having amazing hardware, their software layer lacks libraries, tooling and doesn’t integrate seamlessly with the existing tech stack which is why there is a huge gap between NVIDIA/CUDA and any other competitor.

Compilers: The Translators

TVM (Tensor Virtual Machine), MLIR (Multi-Level Intermediate Representation), and PlaidML offer distinct approaches to the challenge of optimizing AI workloads across diverse hardware architectures.

TVM, born from research at the University of Washington, has rapidly gained traction for its ability to optimize deep learning models for a wide array of devices, from high-performance GPUs to resource-constrained edge devices. Its strength lies in its end-to-end optimization pipeline, which has proven particularly effective in inference scenarios. It fully abstracts the differences of underlying vendors and hardware such that inference workloads could be run seamlessly on top of non-uniform hardware, from NVIDIA devices to AMD, Intel, etc.

Beyond inference, however, things become more complicated. The holy grail — fungible compute for AI training — remains unresolved. However, there are a couple initiatives worth mentioning in this context.

MLIR, Google’s project, takes a more foundational approach. By providing a unified intermediate representation for multiple levels of abstraction, it aims to streamline the entire compiler infrastructure, targeting both inference and training use cases.

PlaidML, now under Intel’s leadership, positions itself as a dark horse in this race. Its focus on portability across diverse hardware architectures, including those beyond traditional AI accelerators, speaks to a future where AI workloads are ubiquitous across computing platforms.

Should any of these compilers become well integrated into the tech stack such that it doesn’t hurt model performance and doesn’t require any additional modifications on the developer side, these initiatives could jeopardize CUDA’s moat by providing a common ground for various AI frameworks and hardware backends. However, at the moment, MLIR and PlaidML are not mature enough and not well integrated into the AI tech stack, thus they are not an obvious threat to CUDA’s dominance.

Distributed Computing: The Orchestrators

Ray and Horovod represent two distinct approaches to distributed computing in the AI landscape, each addressing the critical need for scalable processing in large-scale AI applications.

Ray, developed by UC Berkeley’s RISELab, is a general-purpose distributed computing framework. It excels in its flexibility, allowing for the distribution of various types of workloads beyond just machine learning. Ray’s actor-based model enables developers to easily parallelize Python code, making it particularly useful for reinforcement learning and other AI tasks that require complex, heterogeneous workflows.

Horovod, originally developed by Uber, focuses specifically on distributed deep learning. It provides a simple, efficient way to scale deep learning training across multiple GPUs and nodes. Horovod’s strength lies in its ease of use and performance optimization for data-parallel training of neural networks. It integrates seamlessly with TensorFlow, PyTorch, and other major frameworks, allowing developers to distribute their existing training scripts with minimal code changes.

Closing Thoughts: The Crypto Angle

Integration with existing AI stacks is indeed critical for DePin projects aiming to build distributed computing systems. The integration ensures compatibility with current AI workflows and tools, lowering the barrier to adoption.

The current state of GPU networks in the crypto space, functioning essentially as decentralized GPU rental platforms, represents a preliminary step towards more sophisticated distributed AI infrastructure. Rather than working as a distributed cloud the existing networks more resemble an Airbnb marketplace. While useful for certain applications, these platforms fall short of supporting true distributed training, a crucial requirement for advancing large-scale AI development.

Current distributed computing standards like Ray and Horovod are not designed with the premise of globally distributed networks, for decentralized networks to truly work we need another framework at this layer. The skeptics go as far to say that Transformers are not compatible with distributed training approaches due to their intensive communication requirements and optimizing a global function in the learning process. The optimists, on the other hand, are trying to come up with new distributed computing frameworks that could work well with globally distributed hardware. Yotta is one of the startups trying to solve this issue.

NeuroMesh goes even further. Its approach to redesigning machine learning processes is particularly innovative. By leveraging Predictive Coding Networks (PCNs) to replace global loss functions with local error minimization, Neuromesh addresses a fundamental bottleneck in distributed AI training. This approach not only enables unprecedented parallelization but also democratizes AI training by making it feasible on more widely available hardware like RTX 4090 GPUs. Namely, 4090 GPUs have similar computing power as H100s, however, due to the lack of bandwidth they haven’t been much utilized in the training processes. As PCN reduces the importance of bandwidth it becomes possible to leverage these lower-end GPUs which could introduce significant cost savings and efficiency gains.

GenSyn, another ambitious crypto x AI startup, has set the goal of building a set of compilers that could make compute fungible for AI training — essentially allowing any type of computing hardware to be seamlessly used for AI workloads. To make an analogy, what TVM is for inference, GenSyn is trying to build for training processes. If successful, it could dramatically expand the capabilities of decentralized AI computing networks, enabling them to tackle more complex and diverse AI tasks by efficiently utilizing a wide range of hardware. This moonshot vision, while challenging due to the complex nature of optimizing across diverse hardware architectures and with a high technical risk, aligns with the broader trend towards more flexible and scalable AI infrastructure. Should they execute on this vision, overcoming hurdles like maintaining performance across heterogeneous systems, this technology could weaken the moat of CUDA and NVIDIA by providing a hardware-agnostic alternative for AI training.

In respect to inference: Hyperbolic’s approach, combining verifiable inference with a decentralized network of heterogeneous compute resources, exemplifies this pragmatic strategy. By leveraging compiler standards like TVM, Hyperbolic can tap into a wide range of hardware configurations while maintaining performance and reliability. It could aggregate chips from multiple vendors (from NVIDIA, to AMD, Intel, etc.), both the consumer level hardware and high performance ones.

These developments in the crypto-AI intersection suggest a future where AI computation could become more distributed, efficient, and accessible. The success of these projects will depend not only on their technical merits but also on their ability to integrate seamlessly with existing AI workflows and address the practical concerns of AI practitioners and businesses.


AI Training and Inference Tech Stack: From Silicon to Sentience was originally published in IOSG Ventures on Medium, where people are continuing the conversation by highlighting and responding to this story.

Medium
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
3
Add to Favorites
3
Comments