Dialogue with Nexa AI: Two Stanford graduates born in 1995 made a small model that is 4 times faster than GPT-4o, aiming to be the "end-side version of Hugging Face"

08-27

This article is machine translated

Show original

It is faster than OpenAI’s most powerful GPT-4o, has the same function calling capability as GPT-4, is N times smaller than it, and only requires one card for inference.

This is the "shock" that Nexa AI brought to everyone when it debuted.

Four months ago, the 500 million parameter small model Octopus v2 developed by Nexa AI attracted wide attention in the Silicon Valley AI circle. The Functional Token technology they developed can achieve an excellent reasoning speed that is 4 times faster than GPT-4o and 140 times faster than the RAG solution. At the same time, it has the same performance as GPT-4, and the function call accuracy rate is as high as over 98%.

Octopus v2 won the "No.1 Product of the Day" on the day it was launched on Product Hunt, and accumulated 12,000 downloads in the month of Hugging Face's release. It was also recognized by AI authorities such as Hugging Face CTO Julien Chaumond, technical director Philipp Schmid, and Figure AI founder Brett Adcock.

Nexa AI was co-founded by two young Stanford alumni born in the 1990s, Alex Chen and Zack Li, and currently has eight full-time employees. Charles (Chuck) Eesley, professor of management science and engineering at Stanford University and associate director of the technology venture capital program, and Diyi Yang, professor of the NLP group at Stanford University and Sloan researcher, serve as company advisors.

It is reported that they have successfully signed more than 10 leading corporate clients in the fields of 3C electronics, automobiles, network security, fashion consumption, etc. in a short period of time. They serve more than 1,000 registered users and recently completed a seed round of financing of more than 10 million US dollars.

Less than a month after the release of Octopus v2, Nexa AI released Octopus v3, the first AI model with less than 1 billion parameters and multimodal capabilities.

While maintaining the function call accuracy comparable to GPT-4V and GPT-4, it can run efficiently on various edge devices such as Raspberry Pi, supports text and image input, and can understand Chinese and English. Subsequent products such as Octo-planner, a 3.8 billion parameter model that can perform multi-step query tasks in different domain knowledge, were also launched.

Next, Nexa AI extended its "ambition" to the entire edge model market.

Recently, it launched its first end-side AI comprehensive development platform "Model Hub". The core is a rich AI model library designed and optimized for local deployment. It includes a variety of advanced models such as the self-developed Octopus series, Llama 3.1, Gemma 2, Stable Diffusion and Whisper. It is suitable for efficient operation on various devices without the need for Internet connection and API fees.

In addition to the model library, Model Hub also provides a comprehensive open source SDK that supports developers to deploy models locally and fine-tune and customize them according to personal needs, which is more flexible. There are also a large number of practical examples to help users get started quickly, and a developer community has also been established.

That is, the Hugging Face of an end-to-end model.

"What we really want to build is an on-device version of Hugging Face," Alex Chen told Silicon Star. By integrating models, tools, resources and communities, they are trying to build a complete end-side AI ecosystem.

Recently, Silicon Star also chatted with Alex Chen and Zack Li, the two co-founders of Nexa AI, about their thoughts on edge AI.

The following is a transcript of the conversation:

From Stanford campus to small-scale startup

Silicon Star: Please let Alex and Zack introduce yourselves to everyone.

Alex Chen: I’m Alex, and I’m currently the co-founder and CEO of Nexa AI. Before I founded the company, I was a PhD student at Stanford University, doing research in AI and Math. Zack and I are Tongji alumni, and we’ve known each other for about 10 years. We’ve collaborated on many studies and work together before. For example, we both served as chairmen of the Stanford Chinese Entrepreneurs Association, and during that time, we put many entrepreneurial ideas into practice, but Nexa is the first time we’ve officially established a startup company to do it.

Zack Li: I’m Zack, the co-founder and CTO of Nexa AI. I’ve been working in the industry since I graduated from Stanford, first at Amazon Lab126 working on Echo and Alexa, and then at Google working on Google Assistant and Google Glass, so I’ve accumulated 4 years of industry experience. I started working on Nexa AI with Alex last year. Because the direction we’re working on is very consistent with Alex’s research and my own past work experience, we have a big advantage in model training, customer delivery, and model deployment.

Silicon Star: What is the process like from Stanford campus to starting a business now, especially choosing the direction of small edge models?

Alex Chen: The idea of starting a business first came to us because we both joined the Stanford Chinese Entrepreneurs Association. It is much more formal than ordinary student clubs, and every year a large number of Stanford alumni leave this organization and start their own businesses. For example, Yin Le, partner of Zhen Fund, Zhang Yutong, former partner of Jinshajiang, Li Zhifei, CEO of Mobvoi, Mao Wenchao, founder of Xiaohongshu, and so on. After we joined this organization, we would meet many entrepreneurs and investors on a daily basis, and also hold entrepreneurial activities in the Bay Area. During this period, we learned about the overall picture of entrepreneurship, and began to be more inclined to do something on our own.

This is the earliest embryonic stage. As our own technology and understanding of entrepreneurship gradually deepen, we will do some side projects, which are also closely related to this round of generative AI. In fact, we noticed some trends in generative AI very early. For example, when GPT-3 first came out, Jasper used GPT-3's API to get a revenue of 50 million US dollars. So we focused our energy on generative AI. The initial idea was application-oriented, which means that we don't care about the core technology first, but use the existing technology to make some good products, such as directly producing products by calling GPT-3's API or some open source models of Stable Diffusion.

But later on our thinking changed a bit, which also includes why we chose to do edge AI.

At that time, we analyzed the entire generative AI market. First of all, there are many application companies now, such as email generation, marketing, or AI interview applications. Each vertical category may have hundreds of similar products. It becomes very bloated and may not have long-term profitability because there are too many competitors and no technical barriers.

This is our perception of the market. This fierce competition is also the main reason that prompted us to change our route. We hope to look at some work with higher technical barriers. In addition, Zack had been working on-design AI for 4 years at that time and had accumulated deep industry insights. We analyzed this field and found that when everyone pursues larger cloud models, there are actually very good opportunities on the end side.

Two trends were considered at the time:

First, as algorithms continue to improve, more and more large model functions can actually be completed by small models. For example, GPT-3 may have 175B parameters at first, but now a 7B latest model can basically catch up with GPT-3 in many aspects. Open AI's own models are actually getting smaller. As far as we know, GPT-3.5 is smaller than GPT-3. This trend is the result of further improvement of algorithms and data squeezing.

Secondly, the computing power of the end side is also constantly improving. For example, as the chips of computers and mobile phones continue to evolve, they can support the local deployment of some larger models, so these are two general trends.

Later, we also did some actual research. In January this year, all the company members went to Las Vegas to attend CES, where we saw many examples of local AI model deployment. For example, Qualcomm has been trying to deploy models on various end-side chips.

Silicon Star: So the improvement of algorithms and computing power made you think it was possible to make small models. I went to CES and saw the market situation in person, and finally decided to turn to edge AI.

Alex Chen: Yes.

Small models can solve 99% of the problems

Silicon Star: Do you think scaling law is outdated now?

Alex Chen: Scaling law is not outdated, and I believe it still applies to most people.

Silicon Star: Compared with the large model, where are the opportunities for the small model?

Alex Chen: I think a very good question is asked here, which is the scaling law mentioned just now. When we evaluate the scaling law of a model, the larger the model, the stronger its comprehensive ability must be. But this is an all-round ability improvement. Taking the MMLU indicator as an example, a large model may have strong capabilities in different MMLU subjects, such as Chinese, mathematics, and English. But in many cases, you don’t need it to be strong at all levels, but just to excel in specific areas. Our company will let small models focus on certain specific areas, such as being particularly good at mathematics, or being particularly good at law. This is enough for people in the fields of mathematics and law. He doesn’t need a particularly large model to complete his problem.

Another point is that when we use scaling laws to continuously push the boundaries of models, in fact, the remaining 1% of particularly difficult problems that you need to solve may not all be encountered in daily life. For example, I use GPT-4 with trillion parameters to answer "1+1=2", and this problem can be answered very well by GPT-2, and the parameters between them may differ by thousands to 10,000 times. The same answer can be answered by two completely opposite models, so the small model will be significantly better than the large model in terms of speed and power consumption.

In summary, what do I think are the advantages of a small model? First of all, it is faster and more power-efficient. At the same time, it is basically completely free to deploy on the end side, because local computing power can be used. More importantly, it can fully guarantee personal privacy. For example, we have a large software customer whose app helps people process some ID cards, including ID cards, driver's licenses and other image information. This kind of thing cannot be done through the cloud API because it involves privacy, so a local model must be used to implement this process.

Silicon Star: What makes a useful small model?

Alex Chen: First, it must be fast. Second, it must be comparable to large models in areas that users care about. Third, it must be completely and easily deployed locally, while ensuring privacy and very low costs.

Functional Token solves the problem of small model function calls and “beats” GPT-4o

Silicon Star: What is the current product framework of NEXA?

Zack Li: Let me answer this question. First of all, our customers include developers and large enterprises. For enterprise customers, we provide an end-to-end solution. For example, an e-commerce company gave a clear requirement to automate the release of emails to influencers with potential business cooperation. Then our model can meet this demand, and help them deploy it through the supporting SDK, and then provide a usable product to join their workflow. However, our products are very general, so there is relatively little customization.

For developers, they can go to our Model Hub to find the models they want, such as those for e-commerce or travel scenarios, and then run them locally through our SDK. In addition to supporting Octopus, we also support some classic and standard open source end-side models, such as the Gemma series, the Phi series, and so on.

Alex Chen: Our applicable scenarios are all the problems mentioned above, except for the 1% of particularly difficult problems that large models cannot currently solve. For example, emotional companionship, helping you write emails, polishing articles, etc., all of which can be accomplished through a small model deployed locally. All language model use cases that are not so difficult but can basically meet everyone's daily life are what our product can give everyone to use.

In addition, the powerful functionality we can provide, which is also the biggest highlight of the Octopus model, is that it has strong function calling capabilities.

Silicon Star: This is also the next question I want to ask, what are the core technological advantages of NEXA?

Alex Chen: Yes, our uniqueness is that we can use a very small model deployed locally to compete with the function calling of a large model. It can convert the user's natural language into executable commands. For example, if you want to buy a Samsung phone on Amazon, you can directly enter the purchase requirements in the dialog box, and it will automatically open Amazon and enter the description of the Samsung phone, saving you a lot of graphical operation interface processes. It is equivalent to Octopus can convert many graphical operation interactions into natural language interactions.

Silicon Star: Your paper proposes an innovative concept of Functional Token. Can you explain it? And how does it optimize the AI reasoning process?

Zack Li: In the past, methods such as those based on RAG (retrieval-augmented generation) technology required that when a question came in, relevant information be retrieved from the API documentation or database first, and then provided as context to the large model for decision making. This process first takes time to retrieve information and requires processing a large number of semantic tokens. Because the context window is too long, the inference time is very long, especially on devices with limited computing power and size, which limits the model accuracy and response speed.

Our solution is to output directly through an end-to-end model. The concept of Functional Token was introduced for the first time, using one token to represent the entire function information, including the function name, parameters, and documents, reducing the context length by 95%. When the user enters a natural language instruction, the system can save complicated retrieval steps, quickly identify the key points of the task, trigger the corresponding Functional Token, and directly generate the required output or execute a specific function call.

In the output layer, since the Functional Token replaces the complete function expression, the output can basically be controlled within 10 tokens, so it is more concise. This can significantly save computing resources and context space, while greatly improving processing speed. It is especially suitable for mobile devices or edge computing devices, which require fast response scenarios.

Silicon Star: How does it perform in actual verification?

Zack Li: GPT-4o is a very large model with trillion-level parameters, and uses multiple GPU clusters for inference, but we only use a single A100 card for comparison. Even under such extremely unfair hardware conditions, our Octopus v2 model is still 4 times faster than GPT-4o.

Silicon Star: Octopus v2 received a strong response at X. I saw that you also have Octo-net, Octopus v3 and Octo-planner. Do these models have their own strengths or are they a series of iterations?

Zack Li: v2, v3 and planner are a series of iterations, among which v3 has multi-modal capabilities and planner has multi-step planning capabilities. Octo-net is equivalent to a branch that supports end-cloud collaboration.

Silicon Star: What is the current level of capabilities of your most advanced model?

Zach Li: Our v3 model is the latest for enterprise, and it can support multi-modality below 1B parameters. There may be some excellent end-side companies emerging at home and abroad, but there is no competitor that can achieve multi-modality below 1B and reach our function calling accuracy, and we have not seen any below 2B yet.

Make a "hugging face for the client side"

Silicon Star: In fact, in addition to startups, many giants like OpenAI, Google, and Meta have also begun to expand small models. Do you feel threatened?

Zack Li: Of course, we can feel that the competition is very fierce. But first of all, we have seized a sharp weapon, which is the most difficult function call in the end-to-end test model. At the same time, we can continue to combine Model Hub to encourage more developers to join us, which is equivalent to taking the Hugging Face route. So even though the end-side model has gradually begun to roll inward, we make good models and also make good platforms to allow more developers to use these models. This is our differentiation.

Alex Chen: Actually, what we really want to build is an on-device version of Hugging Face. Hugging Face is an AI research community for cloud developers. It has many model search and usage frameworks based on Python and NVIDIA GPU, but these are provided for server-side developers. What is different about us is that we want to deploy the models locally, so the file formats of these models and the software support required for deployment are different. For example, Hugging Face uses Python, and we use C or C++. These are the core differences.

You can see that we have some software libraries such as SDK, our own Octopus model, and support for local deployment of other small models like Microsoft and Google. This is how we think about the whole thing: In fact, if you look at the cloud, the two typical and more valuable companies are OpenAI and Hugging Face. We are actually like a combination of OpenAI and Hugging Face on the end side. On the one hand, we are working on the end-test model ourselves, and on the other hand, we also hope to further help everyone use the end-side model through this platform.

Therefore, our future business model will be more about providing subscription-based revenue to some on-device developers by maintaining this on-device AI community, and also providing some enterprise services to the enterprises behind these developers.

Silicon Star: On your platform, I can not only use Octopus, but also see edge AI released by many individual developers or companies.

Zack Li: Yes. We have just started to accumulate platform. We tested it in May and had more than 1,000 developers. After that, we continued to polish it internally to prepare for the official launch. We also hope to introduce this product to more people and provide a test link to see everyone's feedback.

The Model Hub will become the main website of NEXA AI. The main product is a platform that allows you to find the required end-side models. The previous research work can demonstrate our independent research and development capabilities, and there is also an entrance to the enterprise.

In Model Hub, you can see the end-to-end models of various companies. Because we are more familiar with the end-to-end, we focus on the formats commonly used on the end-to-end, such as GGUF and ONNX. For example, Meta Llama3.1-8b, we can quantize it into different precisions, such as int4 and int8. This compressed model is specifically suitable for end-to-end operation, unlike Pytorch and Python running in the cloud environment.

The RAM of consumer-grade GPUs is at most 24G, so it is impossible for developers to run the original size model locally. We can help publishers to do batch compression and quantization. Then we also have SDK tools that allow users to easily use models of various modalities on their laptops or mobile phones, and also provide UI display, which relies entirely on local computing power and is very fast.

Just like Hugging Face, it is popular because of the transformers package. Not only can you find models here, but you can also run them and then do secondary development. This is the core of its ability to retain users, right? We actually made this thing.

Entrepreneurship is all about products

Silicon Star: The next question may have been discussed before. Now investors will ask why you, so for you, what is the confidence point that makes your target customers choose NEXA instead of other competitors?

Zack Li: The first point of confidence is the model advantage. The function calling accuracy of our model is very high and the size is very light. The second is the deployment advantage. We can customize different acceleration solutions according to different hardware requirements, operating platforms, memory and overhead of users. In other words, not only is our model better than others, but we also have a framework that can support them to better deploy this model.

Silicon Star: Do these advantages apply when facing OpenAI or Google?

Zack Li: I think OpenAI will not directly touch the field of end-test models for a long time. Its GPT-4o mini is still a cloud model. Google may do it. Of course, Google has advantages in talent and equipment, as well as its own ecosystem. But it is hard to imagine that it will take into account customers other than the Android ecosystem, especially in the end-side hardware. In addition to its own Pixel ecosystem, it will not do things like Model Hub.

Silicon Star: Can you share the latest product progress and the next optimization direction?

Zack Li: In addition to the Model Hub and SDK mentioned above, we have a series of research work in the future. Compression models that support long text processing are also under development. We will provide services for different scenarios in the future. In fact, there are many scenarios on the end side. Function call is one scenario. There are other capabilities such as question answering, multimodal capabilities such as graph understanding, audio processing, etc. These directions will be the focus of attention.

Silicon Star: As an edge AI startup, where do your challenges come from?

Zack Li: Including but not limited to some large companies. They can make their own end models, especially if they have the ability to develop trillion-level large models, they can reuse a lot of experience through distillation or pruning. But we have our own unique insights and understanding of this field in the end model, so I think each has its own advantages.

Then there are some existing community players. Hugging Face is a good example. If they want to do end-to-end testing, it will be a challenge for us. But at present, the entire ecosystem of Hugging Face, including all the past architectures, is cloud architecture, and all services are cloud services. So I think it will be painful for them to make the transition. If they treat it as a project, its momentum and speed will not be that fast.

Silicon Star: You combined the client-side model with the community and entered the market relatively early. Have you organized any offline developer activities to promote it?

Zack Li: Alex and I need to do a lot of model development training and some infra-related work now. Our product and marketing colleagues are responsible for the activities, and we have accumulated a lot of resources in the Bay Area over the years. On August 25, Nexa will co-host a Hackathon in Stanford with Hugging Face, StartX, Stanford Research Park, Groq, and AgentOps. This is our first offline event. You are welcome to come and have a look.

The scene of Super AI Agent Hackathon hosted by Nexa AI. Image source: NEXA AI

Silicon Star: Two last questions. After so many years in Silicon Valley, are there any companies or people that you particularly admire?

Zack Li: I still prefer Elon Musk. He has a saying, "Tough and Calm", which means he has high requirements for things and can stay calm in the face of huge difficulties. I am also working hard to improve myself in this direction. Then you think, he can handle so many companies at the same time, and each company has certain methods to solve different challenges. I think he has a long-term vision and strong execution ability.

But if I am more down-to-earth, I actually prefer Lei Jun. Because I am from Hubei, and Lei Jun is from Xiantao, Hubei. He is very hardworking, friendly, and can think about many problems with his hands-on. He has a typical developer temperament. Whether as an executive, investor or entrepreneur, he is very good.

Silicon Star: What is your biggest feeling since starting your business?

Zack Li: I think that entrepreneurship is all about products. The market will give me the most fair and just feedback, so getting things done is the most important thing. You must have long-term goals and insist on doing difficult but right things. For example, some of the company's initial work may be very product-oriented, without much underlying innovation. The fundamental reason why we suddenly have such a large amount of traffic and momentum is that we have optimized the underlying end-side model, proposed an unprecedented training method, and published papers to apply for patent protection. Without these technologies, it would be impossible to stand out and achieve the current influence. As for the so-called shell companies, I deeply feel that there is almost no way to break through the siege unless you have extremely strong insights into the product.

Silicon Star: So what kind of company do you think Perplexity is?

Zack Li: It has extremely strong insights into products.

*Nexa AI’s latest end-to-end AI model community Mobile Hub was launched on its official website on August 22, with a direct link: https://www.nexaai.com/models .

This article comes from the WeChat public account "Silicon Star Pro" , author: Jessica, published by 36Kr with authorization.

Sector:

Smart Contracts

DeFi

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

MarsBit

The Black Swan Event Was Actually This: The Real Reason for the Recent Bitcoin Crash

BTC

1.55%

ODAILY

The Black Swan Event Was Actually This: The Real Reason for the Recent Bitcoin Crash

BTC

1.55%

ME News

Breaking News! The Year of China's RWA: A Compliant Channel Opens for Trillions of Yuan in Domestic Assets to Go Global