When a large model performs graph reasoning tasks, do we want the large model to only give results, or **do we want it to output a detailed reasoning process while giving accurate answers?**

Let’s first look at the performance of **GPT-4** :

A very short and incorrect answer (judging that there is no cycle in the graph) was given, which may be due to the limitations of the model in processing long inputs, or a misunderstanding of the complex structure of the graph. This shows the challenges faced by large models when adapting to graph theory problems.

In contrast, **GraphWiz** developed by the HKUST team not only gives the correct answer, but also provides a clear and detailed reasoning path.

GraphWiz is designed to improve the capabilities of currently open-source large-scale models in solving various graph reasoning tasks:

By fine-tuning large models in a targeted manner, we can handle graph reasoning tasks of varying complexity while outputting clear and coherent reasoning paths.

It is extremely challenging for humans to detect rings in graphs of this size. Usually, humans need to use external tools or spend a lot of time to complete this task because it is impractical to rely on mental calculations alone.

This highlights GraphWiz's ability in spatial reasoning and memory retention. It shows that the model has effectively absorbed the basic principles of graph theory and is able to autonomously navigate and reason in large-scale and complex graph structures. GraphWiz's ability to handle complex graph problems demonstrates its great potential in practical applications.

In general, the main contributions of this article are as follows:

Created GraphInstruct, a large-scale dataset for training language models to handle graph tasks, providing clear reasoning paths and improving interpretability.

Launched GraphWiz, an open source large-scale language model that excels at solving various graph problems through explicit reasoning, outperforming GPT-4.

The effects of the amount of training data and the sampling strategy of the DPO framework on model performance were studied, and the cross-task migration capabilities of GraphWiz were explored to provide guidance for subsequent model optimization and performance improvement.

## **Introduction to graph reasoning tasks**

In this study, the team carefully selected nine graph problems of different computational complexity levels, covering the breadth and depth of research, including:

Four linear complexity tasks: **connectivity and cycle detection, bipartite graph testing, topological sorting** ;

Three polynomial complexity tasks: shortest path, maximum triangle sum, and maximum flow;

And two NP-complete tasks: Hamiltonian paths and subgraph matching.

By selecting these nine graph problems, the team's work conducts a comprehensive exploration of graph theory from simple to complex, solvable to difficult problems. This diverse selection not only helps the team understand graph algorithms in theory, but also solves a wide range of practical application problems.

## **GraphInstruct dataset construction**

The construction of GraphInstruct includes the following key steps:

**Graph problem generation.** In order to create a diverse and challenging graph problem library for model training and testing, the team used a programming-assisted approach to generate random graph problems for each pre-set task. The team designed a unique template for each task to capture the unique properties of the graph, such as whether the graph is directed or undirected, whether the edges have weights, etc. The random graph generation team used the Erdős-Rényi (ER) model.

**Explicit reasoning path generation.** GraphInstruct is equipped with an explicit reasoning path for each graph-question pair. Considering that manually annotating the reasoning paths of these graph tasks is both complex and time-consuming, the team chose to use GPT-4 to generate preliminary reasoning paths.

**Data augmentation and rejection sampling.** Due to the observation that GPT-4 performed poorly on many graph tasks, such as less than 100 samples were correct on the maximum flow task in the initial dataset, the team adopted a rejection sampling strategy to augment the dataset to include more diverse reasoning paths.

**Select diverse reasoning paths.** This step requires finding a balance between accuracy and diversity. To this end, the team adopted a series of refined strategies, which are divided into string-based and semantic-based methods to screen out different generative reasoning paths.

## **GraphWiz Training**

Based on GraphInstruct, the team trained GraphWiz, aiming to optimize the ability of current large models to solve graph problems and give explicit reasoning paths. The training method of GraphWiz is an innovative two-stage process:

Mixed-Task Instruction **Tuning** : In the first phase, the team focused on improving the model’s ability to interpret and solve various graph problems. With this approach, GraphWiz learns to handle multiple subtasks including understanding the problem, identifying graph properties, and applying graph algorithms.

**Direct Preference Optimization** Alignment: In the second phase, the team further sharpened the model’s reasoning capabilities by training the model to distinguish between more effective and less effective problem-solving paths. DPO alignment enables the model to identify and generate more ideal reasoning paths, thereby improving the efficiency and accuracy of problem solving.

## **GraphWiz Performance Evaluation**

The team evaluated GraphWiz to answer the following key questions:

Q1: How does GraphWiz perform on graph problems of different complexity, especially how does it compare with GPT-4, the most powerful closed-source model currently?

Q2: What impact does the change in the amount of training data have on the performance of GraphWiz?

Q3: What is GraphWiz’s ability to migrate different graph problems?

Q4: How does changing the number of nodes in the graph affect the performance of GraphWiz? Also, what is the largest complex graph it can handle efficiently?

Q5: How does the hyperparameter ß affect model performance?

As can be seen in the table above, the team's model demonstrates superior results on a variety of open source models, significantly exceeding the performance of GPT-4. This is consistent across a variety of tasks from easy to difficult categories. DPO further improves the average model performance. However, DPO may have an adverse effect on specific tasks. This suggests that while DPO generally helps improve model reasoning, it may need further tuning to avoid negative effects on certain problem types.

According to the table above, the team observed that as the training corpus increased, both models improved in effectiveness, such as the average accuracy of GraphWiz (Mistral-7B) increased from 46.56% at a 1:1 ratio to 53.75% at a 1:5 ratio. This indicates that more diverse reasoning paths are generally beneficial to the overall performance of the model in solving graph reasoning problems.

The team could notice that on some tasks, such as the triangle and Hamiltonian path problems, the accuracy did not improve significantly, and even decreased slightly with the amount of data. For example, GraphWiz (Mistral-7B) had an accuracy of 47.00% on the triangle sum problem at a 1:1 ratio, and then dropped to 38.75% at a 1:5 ratio. This could indicate overfitting, where the model starts to memorize patterns in the training data that do not apply to unseen data.

In summary, while increasing the amount of data and the diversity of inference paths can generally lead to better model performance, there are signs of potential overfitting in some complex tasks, which emphasizes the need to carefully design model training and validate on different graph problem tasks to ensure broad generalization capabilities.

To explore the transferability of GraphWiz across different graph tasks, the team built an additional model variant: **GraphWiz-High** . This model is trained only on two high-complexity (NP-complete) graph tasks: Hamiltonian paths and subgraph matching. To study its transferability, the team conducted two comparative experiments:

**Comparison of high-complexity tasks.** The team first compared GraphWiz-High with regular GraphWiz on high-complexity tasks. The figure above (a) shows that GraphWiz performs better, verifying the effectiveness of mixed task training. This result also shows that the model is able to transfer knowledge learned from other tasks to specific high-complexity tasks.

**Zero-shot migration capability.** The team further tested the zero-shot migration capability of GraphWiz-High on low- and medium-complexity tasks that it had never been trained on. As shown in Figure (b) above, GraphWiz-High has a significant performance improvement over Mistral-Base. Even compared to ChatGPT, the team's model maintains comparable performance. Considering the huge difference in the number of parameters between ChatGPT and GraphWiz-High, this shows that the team's model has commendable cross-task generalization capabilities, demonstrating significant potential for practical applications.

To answer questions about how the model’s performance changes with different graph sizes, and to determine the maximum graph size that the model can effectively solve, the team shows the performance of GraphWiz on the best performing task (a) cycle detection and the worst task (b) shortest path.

From the graph, the team concluded that:

Both GraphWiz and GPT-4 show a drop in performance as the size of the graph increases. However, the team’s model outperforms GPT-4 most of the time when the graph size is consistent, indicating a stronger understanding and processing ability of the graph structure.

The team observed a significant drop in performance with increasing number of nodes on the shortest path. This drop can most likely be attributed to two main factors: the high reasoning and memory requirements of the task, which may pose additional challenges to the model's capacity due to higher time complexity, and strong computational techniques. In fact, the team found that both models rely mainly on enumeration to arrive at a solution. Therefore, the required enumeration reasoning grows exponentially with the increase in graph size, resulting in a significant drop in accuracy when the number of nodes exceeds 60, and almost no accuracy thereafter.

These observations suggest that while GraphWiz significantly outperforms GPT-4 on graph-related tasks, there is a threshold of complexity—particularly in tasks that require computation beyond simple reasoning—where the performance of even state-of-the-art models begins to degrade significantly.

Finally, the team also explored the impact of the parameter ß on the model's performance. The team observed that higher ß seemed to benefit the performance of difficult tasks to some extent, but this was not a strictly linear relationship and was inconsistent across different model sizes. This suggests that careful tuning of ß is necessary to achieve the best balance between tasks of different difficulty levels and improve the overall accuracy of the model.

## **More examples**

The team also demonstrated more GraphWiz reasoning examples on different tasks.

Connectivity tasks:

Hamiltonian Path Task:

Shortest path task:

Subgraph matching task:

Paper link: https://arxiv.org/abs/2402.16029 Project homepage: https://graph-wiz.github.io/

This article comes from the WeChat public account "Quantum Bit" (ID: QbitAI) , author: Chen Nuo from the Hong Kong University of Science and Technology, and is authorized to be published by 36Kr.