Do you think GPT content generation is too slow? Finally a company is solving this problem

avatar
36kr
02-21
This article is machine translated
Show original

As soon as I got to work today, I found a new job in a company abroad.

This company called Groq has launched a chatbot page, which looks rougher than ChatGPT. There are no instructions on the page, and it really doesn't arouse much interest.

Until I watched the demo video below. . .

It’s hard to imagine that this is the speed at which AI “generates” content, which looks the same as directly finding the answer and then copying and pasting it.

Mr. Siji asked GPT a question and he could reply to a few WeChat messages if he had time to input it. . .

If we carefully observe the video above, we can find that there is a parameter in the Groq test that is generally not displayed on other large model websites-325.68 T/s.

This parameter is also emphasized by Groq on the promotion page. The specific meaning is how many tokens the large model can calculate per second.

Let’s briefly talk about what tokens do in the large model. During large model training, inference, and generation, the text will be divided into the smallest units, namely tokens. For example, if you ask chatgpt a question, chatgpt will first cut your complete words into tokens and then perform calculations. When chatgpt answered, it was not all input at once, but one token after another.

How chatgpt segments text, you can refer to OpenAI's word segmenter web page. For example, if you say "I hope Lao Huang will give me a 4090 graphics card to play Minesweeper", chatgpt will split it into 22 tokens.

https://platform.openai.com/tokenizer

According to evaluation data from ArtificialAnalysis.ai, the Mixtral 8x7B interface provided by Groq set a new large model throughput record, reaching 430 Tokens per second.

Of course, we still need to compare multiple aspects to get a complete idea of how fast Groq is. There is a test on github for the running speed of a 70B large model on different platforms. It can be found that Groq is far ahead in terms of generating tokens per second and response speed.

In the world of martial arts, only fast ones can survive, and this statement also applies to large models. Not long ago, there was a discussion on the Internet about whether chatgpt was slowing down. Some people say that OpenAI restricts free users. I don’t know the specific reason, but it can be seen that the speed of large model generation is indeed a pain point for users.

It can be imagined that in order to improve the user experience, an e-commerce company introduces AI customer service. With the same words, the user's experience will be very different if the customer responds within seconds and after 10 seconds or so.

Similar ones include AI live broadcast, AI writing, etc. In the application process of large models, the speed of large model generation must be very important.

But in fact, Groq's accuracy in answering questions is really worrying. It basically doesn't get correct answers to slightly more complex questions. It looks like the nonsense aunt who has become popular recently.

However, Groq does not sell large models, it sells AI chips.

To put it simply, the point they want to promote is, "With my chip, your model can generate content so fast."

He even directly shouted out to Jen-Hsun Huang that the inference speed of this chip is 10 times faster than Nvidia’s!

Groq's self-developed chip is called LPU.

According to the official website, Groq is a generative AI solutions company and the creator of the LPU inference engine, the fastest language processing accelerator on the market.

It is built from the ground up to achieve low latency, energy efficiency, and repeatable inference performance at scale. Customers rely on the LPU inference engine as an end-to-end solution to run large language models (LLMs) and other generative AI applications up to 10x faster.

In other words, any model running on LPU can be improved in speed.

In order to promote its LPU, Groq even called out the AI industry giants Meta's Zuckerberg and OpenAI's Altman on its official website.

At the technical level of LPU, according to the official website, it aims to overcome two major LLM bottlenecks: computing density and memory bandwidth.

As far as LLM is concerned, LPU has higher computing power than GPU and CPU. This reduces the time required for each word calculation, allowing faster generation of text sequences. In addition, eliminating the external memory bottleneck makes the performance of the LPU inference engine on LLM improve by orders of magnitude compared to the GPU.

According to twitter netizens, the main reason why LPU is faster than GPU is the storage technology and architecture design it uses.

LPU uses SRAM (Static Random Access Memory) instead of HBM (High Bandwidth Memory) commonly used in GPUs. The access speed of SRAM is approximately 20 times that of HBM, which allows the LPU to access and process data faster when processing data. In addition, the temporal instruction set computer architecture adopted by the LPU reduces the need for repeated access to memory, further improving processing efficiency. .

Speaking in human terms, a vivid example is:

Comparing LPU and GPU to two chefs, LPU has an efficient toolbox (SRAM) that contains all the materials he needs. He can get anything at his fingertips without going far. The GPU materials are all in a large warehouse (HBM). Every time you need materials, you have to go to the warehouse to get them, which takes more time. Even if the warehouse is large and can store a lot of materials (high bandwidth), the time required to travel back and forth slows down the entire cooking process.

When SK Hynix saw it and said that HBM was not easy to use, wouldn’t it be anxious to death?

After briefly looking at the technology of LPU, the team behind Groq has a strong background.

Groq was not born out of nowhere. There is also the shadow of Google behind its establishment (combined with another recent hot spot sora, I feel sorry for Google).

Groq is a company founded in California in 2016 by former Google employee Jonathan Ross. Jonathan Ross is also the earliest team member of Google TPU.

For Google, TPU basically covers most of their computing power needs. It is reported that Gemini, the most powerful and versatile artificial intelligence model announced by Google today, uses TPU for training and services.

Back to the model itself, generally speaking, changes in computing power will only affect the speed of model inference. However, since the calculation amount of large models is not small, some changes may occur after the number of decimal places is continuously optimized. So compared to the GPU, does Groq's LPU have an impact on the quality of large model generation?

Mr. Silicon asked about Llama-2-70b on Groq and Llama-2-70b on POE, the same question "Introduce Elon Musk in 100 words".

In terms of speed, the two platforms are similar, and the results are slightly different, but basically they are smooth.

Currently Groq supports API access and provides a total of 3 models, namely Llama 2 70B, Llama 2 7B and Mixtral 8x7B SMoE. In terms of price, Groq is also quite cheap. The input and output prices of Llama 2 70B are 0.7$/1000k tokens and 0.8$/1000k tokens respectively. In terms of price, Groq guarantees that it will be lower than the equivalent price on the market.

In the past few days, sora has been popular all over the Internet, but other companies are not idle either. Google released Genimi pro 1.5, which supports a context length of 1000K, extending the width of large models a lot. Groq brings LPU, which increases the generation speed of large models by 10 times.

Combined with the improvements in computing power and scale of previous large models, Sijijun is very much looking forward to the continued evolution of large models.

References:

[1] 10x Nvidia GPU: Large model-specific chips became famous overnight, from the Google TPU entrepreneurial team | Heart of the Machine

[2] Groq may be the world’s fastest large language model inference service: measured Mixtral 8×7B model with 45 billion parameters output at 500 tokens per second | DataLearner

[3] New breakthrough in Groq technology: Mixtral 8x7B model achieves a generation speed of 500 tokens per second | Speculation view

This article comes from the WeChat public account "New Silicon NewGeek" (ID: XinguiNewgeek) , author: Dong Daoli, editor: Zhang Zeyi, visual design: Shu Rui, 36 Krypton released with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments