Original author: @BlazingKevin_, the Researcher at Movemaker
Nvidia has quietly recovered all the decline brought by Deepseek and even reached a new high. The evolution of multimodal models has not caused chaos but instead deepened the technical barriers of Web2 AI - from semantic alignment to visual understanding, from high-dimensional embedding to feature fusion, complex models are integrating expression methods of various modalities at an unprecedented speed, building an increasingly closed AI highland. The US stock market has also voted with its feet, with both crypto and AI stocks experiencing a small bull market. However, this wave of enthusiasm has nothing to do with Crypto. The Web3 AI attempts we've seen, especially the evolution of the Agent direction in recent months, are almost completely wrong: naively trying to assemble Web2-style multimodal modular systems with a decentralized structure, which is actually a dual misalignment of technology and thinking. Today, with extremely strong module coupling, highly unstable feature distribution, and increasingly centralized computing power, multimodal modularization simply cannot stand in Web3. What we want to point out is: the future of Web3 AI is not in imitation, but in strategic detour. From semantic alignment in high-dimensional space to information bottlenecks in attention mechanisms, to feature alignment under heterogeneous computing power, I will elaborate on why Web3 AI should adopt the strategy of surrounding the city from the countryside.
Web3 AI Based on Flattened Multimodal Models, Semantic Misalignment Leads to Low Performance
In modern Web2 AI multimodal systems, "semantic alignment" refers to mapping information from different modalities (such as images, text, audio, video, etc.) to the same or mutually convertible semantic space, enabling the model to understand and compare the intrinsic meanings behind these originally different signals. For example, a cat photo and the text "a cute cat" need to be projected to positions close to each other in a high-dimensional embedding space, so that the model can "talk about the picture" or "associate images from sound" during retrieval, generation, or reasoning.
Only when high-dimensional embedding space is achieved can workflow modularization have cost-reduction and efficiency-increasing significance. However, in Web3 Agent protocols, high-dimensional embedding cannot be achieved because modularization is an illusion of Web3 AI.
[The translation continues in the same professional and accurate manner, maintaining the technical terminology and nuanced language of the original text.]Here is the English translation:Therefore, requiring Web3 AI to achieve high-dimensional space is equivalent to demanding that the Agent protocol independently develop all relevant API interfaces, which goes against its original modular intent. The modular multi-modal system depicted by small and medium enterprises in Web3 AI does not withstand scrutiny. High-dimensional architecture requires end-to-end unified training or collaborative optimization: from signal capture to policy calculation, and then to execution and risk control, with all stages sharing the same representation and loss function. The "module as plugin" approach of Web3 Agent instead exacerbates fragmentation - each Agent's upgrade, deployment, and parameter tuning are completed within their own silo, making synchronized iteration difficult, and lacking effective centralized monitoring and feedback mechanisms, causing maintenance costs to soar and overall performance to be limited.
To achieve an intelligent agent with industry barriers requires end-to-end joint modeling, cross-module unified embedding, and systematic engineering of collaborative training and deployment to break through the current situation. However, the current market does not have such pain points, and naturally, there is no market demand.
In low-dimensional spaces, attention mechanisms cannot be precisely designed
[The rest of the translation follows the same approach, maintaining the specified translations for technical terms]In Web2 AI, the loss of downstream tasks is continuously fed back to various parts of the model through attention and fusion layers, automatically adjusting which features should be enhanced or suppressed, forming a closed-loop optimization. In contrast, Web3 AI relies heavily on manual or external processes to evaluate and adjust parameters after API call results, lacking automated end-to-end feedback, which makes fusion strategies difficult to iterate and optimize online.
Barriers in the AI Industry Are Deepening, but Pain Points Have Not Yet Emerged
Because it requires simultaneous consideration of cross-modal alignment, precise attention calculations, and high-dimensional feature fusion in end-to-end training, Web2 AI's multimodal systems are often extremely large engineering projects. They not only require massive, diverse, and precisely annotated cross-modal datasets but also involve investing thousands of GPUs for training over weeks or even months. In terms of model architecture, they integrate various latest network design concepts and optimization technologies. In engineering implementation, they must build scalable distributed training platforms, monitoring systems, model version management, and deployment pipelines. In algorithmic research, they need to continuously study more efficient attention variants, more robust alignment losses, and lighter fusion strategies. Such comprehensive, full-stack systemic work has extremely high requirements for capital, data, computing power, talent, and even organizational collaboration, thus constituting strong industry barriers and creating core competencies mastered by only a few leading teams.
In my April review of Chinese AI applications and comparison with Web3 AI, I mentioned a point: In industries with strong barriers, Crypto might achieve breakthroughs, meaning certain industries are already very mature in traditional markets but have emerged with huge pain points. High maturity means users are sufficiently familiar with similar business models, while significant pain points mean users are willing to try new solutions, thus having a strong willingness to accept Crypto. Both are indispensable. Conversely, if an industry is not mature in traditional markets and lacks significant pain points, Crypto cannot take root and will have no survival space, with users having low willingness to understand it and being unaware of its potential.
Web3 AI or any Crypto product claiming Product-Market Fit needs to develop using a "surrounding the city from rural areas" tactic, should test small-scale at edge positions, ensure a solid foundation, and then wait for core scenarios or target cities to emerge. The core of Web3 AI lies in decentralization, with its evolution path manifesting as high parallelism, low coupling, and compatibility of heterogeneous computing power. This gives Web3 AI advantages in edge computing scenarios, suitable for lightweight structures, easily parallelizable, and incentivizable tasks such as LoRA fine-tuning, post-training behavior alignment, crowdsourced data training and annotation, small foundational model training, and collaborative training on edge devices. These scenarios have lightweight product architectures with flexible roadmaps. However, this doesn't mean opportunities exist now, as Web2 AI's barriers are just beginning to form, and Deepseek's emergence has actually stimulated progress in complex multimodal AI tasks. This is competition among top enterprises, the early stage of Web2 AI dividends. I believe Web3 AI's entry opportunity will only come when Web2 AI's dividends are exhausted, just like the birth of DeFi. Before that moment arrives, Web3 AI will continue to create pain points entering the market. We need to carefully distinguish which protocols have the "surrounding the city from rural areas" approach, whether they can take root in weak-power, market-scarce rural scenarios (or small markets/scenarios), gradually accumulate resources and experience; whether they can combine point and surface, proceed circularly, continuously iterate and update products in sufficiently small application scenarios. If unable to do this, achieving a $1 billion market value through Product-Market Fit becomes extremely difficult, and such projects won't be on the observation list. We also need to assess whether they can wage a protracted war with flexibility, as Web2 AI's potential barriers are dynamically changing, and corresponding potential pain points are evolving. We must observe whether Web3 AI protocols have sufficient flexibility to quickly adapt to different scenarios, move rapidly between rural areas, and approach target cities at the fastest speed. If the protocol itself is too infrastructure-dependent with a massive network architecture, the possibility of being eliminated is high.
About Movemaker
Movemaker is the first official community organization authorized by the Aptos Foundation, jointly initiated by Ankaa and BlockBooster, focusing on promoting the construction and development of the Aptos ecosystem in the Chinese-speaking region. As the official representative of Aptos in the Chinese-speaking area, Movemaker is committed to creating a diverse, open, and prosperous Aptos ecosystem by connecting developers, users, capital, and numerous ecosystem partners.
Disclaimer:
This article/blog is for reference only and represents the author's personal views, not Movemaker's position. This article does not intend to provide: (i) investment advice or recommendations; (ii) an offer or solicitation to buy, sell, or hold digital assets; or (iii) financial, accounting, legal, or tax advice. Holding digital assets, including stablecoins and Non-Fungible Tokens, carries extremely high risks with significant price volatility and potential total loss of value. You should carefully consider whether trading or holding digital assets is suitable for your financial situation. For specific questions, please consult your legal, tax, or investment advisor. Information provided (including market data and statistics, if any) is for general reference only. Reasonable care has been taken in writing these data and charts, but no responsibility is accepted for any factual errors or omissions.



