Goodbye, Devin, based on GPT-4o, the most powerful "AI engineer" Genie is born.

avatar
36kr
08-13
This article is machine translated
Show original

The crown of AI coding has changed hands again. Genie surpasses Devin to become the most powerful "AI software engineer" on the planet. Genie is not a programming assistant, but a "colleague" who can think independently and fight alongside you.

I wonder if you still remember the first "AI programmer" Devin, which was developed by Cognition AI, a startup team with 10 IOI gold medals, and launched in March this year.

It is supported by GPT-4 on the backend, can receive natural language instructions in text form, and write code autonomously.

When it was first launched, although it was not able to replace programmers, it still left a deep impression on people.

Now, five months later, the booming GenAI field is very different. Not only has GPT-4 ushered in the next generation model GPT-4o, but the newly released Claude 3.5 Sonnet, Codestral and other models have excellent encoding performance.

Devin’s direct challenger is Genie, an autonomous AI engineer developed by the startup Cosine.

The report released by Cosine showed that in SWE-Bench, Genie scored 30.08%, easily surpassing Devin's score of 13.8%.

Alistair Pullen, co-founder and CEO of Cosine, said: "The capabilities of the (Genie) model cannot be summarized by a benchmark score: it is trained from the beginning to think and act like a human software engineer (SWE)."

I’m happy to share that we have built the strongest AI software engineers in the world, scoring 30.08% on SWE-Bench, ahead of Amazon and Cognition.

Since the CEO claimed that Genie can think and act like a human software engineer, netizens joked, "You mean it can't talk to women and it will sweat if you call it?"

01 What is Genie? What can it do?

Similar to Devin, Genie can also autonomously complete various coding tasks under the guidance of human engineers, including bug fixing, feature building, code refactoring, and code verification through comprehensive testing.

In addition to operating autonomously, Genie can also collaborate with users.

Currently, Genie is still in the internal testing stage, and you can apply for a trial after registering your information on the official website.

Cosine claims Genie can simulate the cognitive processes of human engineers.

“My idea was simple: have it observe how human engineers do their work and mimic that process,” Pullen explained in a blog post.

The code generated by Genie is stored in the user's GitHub repo, which means Cosine does not keep a copy and there are no security risks that come with it.

Additionally, Cosine’s software platform has been integrated with Slack and system notifications, which it can use to alert users, ask questions or flag issues, just like human colleagues would.

“Genie can also ask users clarifying questions and respond to comments/comments on the Pull Requests it generates.”

“We’re trying to make Genie behave like a colleague, so it makes the most sense for the model to use the channels of a colleague,” Pullen said.

Cooperate with OpenAI and use the latest GPT-4o

Unlike many models that rely on a base model supplemented by a few tools, Genie was developed through a proprietary process that involved training and fine-tuning a model from OpenAI.

When Genie was first developed, it could only be fine-tuned based on models with relatively small context windows, ranging from 16-32k tokens.

In early exploration, the team found that even with a large dataset of more than 100 million tokens, coupled with the advantages of the design architecture and various compression/chunking methods, it was still limited by the amount of information that the model could express at a specific moment. The only way was to use a model with a larger context window.

Fortunately, they gained access to OpenAI’s long-context model shortly thereafter, which became a breakthrough in Genie’s capabilities.

“Genie is (currently) a non-general purpose GPT-4o variant that OpenAI gave us access to and trained using their model as part of an experimental program,” Pullen told VentureBeat.

“The model performed well, and we shared our learnings with the fine-tuning team and engineering leadership at OpenAI. This was a real turning point for us because it convinced them to invest resources and attention in our new technique.”

Although Cosine did not specify the specific model, OpenAI recently announced the limited availability of the GPT-4o long output context model, with an output length of up to 64k tokens, a 16-fold increase from the original 4k.

Training data is key

In recent training runs, Genie was trained on billions of tokens, a combination of which was chosen to make the model as competent as possible for the languages that current users care about most, Pullen wrote in a technical report.

Genie's technical report lists the 15 languages included in the training data, including popular languages such as Java, JS, C, C++, C#, Rust, Python, as well as commonly used languages such as Scala, Kotlin, Swift, PHP, etc.

Among them, JavaScript, Python, TypeScript and TSX are the languages with the largest share in the dataset, and the rest account for 3% each.

Cosine's blog post stated that the team spent nearly a year compiling the dataset, which includes a large amount of software development activities from real engineers.

It is extremely difficult to obtain and effectively use this data because, essentially, it does not exist.

Their data pipeline starts with tracking the development trajectory of software engineers, collecting data such as pull requests, commits, and issues from OSS repositories (MIT license).

This data is then run through the pipeline to forensically derive the reasoning process and reconstruct how humans reached the final conclusion.

The proportion of various task types in the dataset

This proprietary dataset served as the basis for training the first version of the model, and the rest of the work was done by self-playing and self-improvement.

Genie’s autonomy loop consists of four main processes: planning, searching, writing code, and running code. These are not novel in themselves, but they are improved upon because Genie is trained to perform tasks like a human.

“The impact of data annotation cannot be underestimated. It is very difficult to obtain high-quality data from capable software engineers, but the results are worth it because it gives us insights into the way developers think about solving problems that are not easily discovered.”

This dataset not only embodies perfect information context and progressive knowledge discovery, but also captures the step-by-step decision-making process of human engineers.

Pullen asserts, “By actually training our model using this dataset, rather than simply hinting at a base model (which is what others are doing), we found that we were no longer just randomly generating code, but approaching the problem like a human would.”

Benchmark Assessment Results

During the model development process, the team mainly used two benchmarks for evaluation - SWE-Bench and HumanEval.

The former covers a more comprehensive range of issues, including decomposing problems, finding relevant codes, classifying codes, and implementing feasible solutions; the latter focuses more on writing code, has no retrieval content, and places less emphasis on problem understanding.

However, the official blog only disclosed the SWE-Bench scores, with Genie achieving 30.08% and SWE-Lite achieving 50.67%.

Among them, Genie's performance in SWE-Bench is very impressive: this is the highest score so far, an increase of more than 10% compared to the second place of 19.27%.

In addition, the team also tested the model's information retrieval capabilities in isolation, specifically its ability to retrieve the correct parts of a desired code file.

This is one of the core components of an AI engineer - if the model cannot reliably and skillfully find the right code to edit, then the ability to edit code cannot be fully utilized.

Assuming the model can find the correct code every time, a simple measure of retrieval ability can be seen by looking at how many lines of code the model found to complete the task and how many lines of code it actually found.

In the test, Genie successfully retrieved 91,475 lines of code out of 142,338 lines, with a score of 64.27%. There is obviously a lot of room for improvement here, and compared with the ability to decompose the problem, the retrieval ability is an aspect that has received less attention.

02 Backed by YC, led by a Chinese Oxford Master

Cosine was founded through Silicon Valley's famous Y Combinator startup accelerator.

The company is a human reasoning lab focused on studying and codifying the way humans perform tasks with the goal of teaching artificial intelligence to imitate, excel at, and scale those tasks.

In 2022, Alistair Pullen, Sam Stenner and Yang Li co-founded Cosine, positioning it as a human reasoning laboratory.

They hope to start from the field of software engineering, to study and organize the way humans perform tasks, so as to teach AI to imitate, excel and expand these tasks, and promote the development of intelligence.

Cosine has raised $2.5 million in seed funding from Uphonest and SOMA Capital, with participation from Lakestar, Focal and others.

With a small but highly skilled team, Cosine has made great strides in the field of artificial intelligence, and Genie is just the beginning.

“We truly believe we can reproduce human reasoning for any job and industry,” Pullen said in the announcement blog post.

“Software engineering is just the most intuitive starting point, and we can’t wait to show you everything else we’re working on.”

It is worth mentioning that there is a Chinese face in the founding team, Yang Li.

Li graduated from the Department of Sociology at Oxford University and was named one of Forbes 30 Under 30 in 2021.

Prior to founding Cosine, he had six work/entrepreneurship experiences, including being the business director of Meituan’s Mobike business.

It can be seen that before 2022, Yang Li continued to explore new opportunities in the industry at a frequency of one jump per year.

Now, Yang Li describes himself on his Twitter profile like this: Experienced 1 IPO, 2 acquisitions and 3 unicorns.

One IPO refers to increasing the number of monthly active users of Mobike to 220 million, and then to an IPO of US$55 billion.

03 The Future of Genie

Pullen revealed the pricing model that Genie may adopt in an email to VentureBeat. In the early stages, product pricing will be divided into two types:

The first is for individuals and small teams. Compared with existing AI tools, the price is competitive, about $20. Products at this level will have some limitations in terms of functionality and usage.

The second level is for enterprises. It has more functions, almost unlimited use, and can create a perfect AI colleague and code expert. The price of this level will be higher.

“We’ve been pursuing a dream of creating an artificial colleague that can truly automate end-to-end programming tasks without intervention and with high reliability. Genie is the first step in realizing this dream,” Pullen wrote in a blog post on Cosine.

The launch of Genie has far-reaching implications for software development teams, especially those looking to increase productivity and reduce time spent on routine tasks.

With the ability to autonomously handle complex programming challenges, Genie has the potential to change the way engineering resources are allocated, allowing teams to focus on more strategic initiatives.

“The idea that engineering resources were no longer a constraint was a huge driver for me, especially after starting a company,” Pullen wrote.

Artificial intelligence can jump into an unknown code base and solve unknown problems several times faster than humans. Its value is self-evident and will have a huge impact on the world.

Cosine has ambitious plans for Genie's future development.

“We are accelerating and revolutionizing the technology team through Genie. Our main goal is to balance real products with cutting-edge research.”

- Improve the data set to enhance Genie's capabilities. By broadening the data and introducing new features, Genie will be proficient in more programming languages and the latest frameworks to accurately meet the work needs of developers.

- Expand its model portfolio. Including small models for simple tasks and large models capable of handling more complex challenges. Leveraging unique datasets will enable Cosine to convert any state-of-the-art base model into a Genie model.

- Extend work to the open source community. For example, contextually extend a leading open source model and pre-train it using a large dataset.

- Fine-tune Genie with a specific code base. This is an enterprise feature that enables Genie to have a perfect understanding of large, legacy code bases, even if the code is written in less popular or proprietary languages.

Pullen said that as the company continues to refine Genie, it will continue to release updates to customers, optimize interactions with this artificial colleague and collect valuable feedback.

Li imagined on Twitter that Cosine aims to encode human reasoning ability, and there will be no more oversampling and copilot in the future.

References:

https://venturebeat.com/programming-development/move-over-devin-cosines-genie-takes-the-ai-coding-crown/

https://cosine.sh/blog/genie-technical-report

https://cosine.sh/blog/state-of-the-art

This article comes from the WeChat public account "Xinzhiyuan" , author: Xinzhiyuan, published by 36Kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments
Followin logo