A LoRA achieves GPT-4o-level image editing, and the new model of Zhejiang University and Harvard ranks second on the Hugging Face list

05-07

This article is machine translated

Show original

[Translation]

[Introduction] As commercial large models like Gemini and GPT-4o push text-based image editing to new heights, obtaining higher-quality editing data for training and developing models with larger parameters seems to be the only way to improve image editing performance. However, this Zhejiang University and Harvard team took a different approach, achieving high-quality image editing with only 0.1% of the previous data volume (obtained from public datasets) and 1% of training parameters, matching or even surpassing commercial large models in some aspects!

[The rest of the translation follows the same professional and accurate approach, maintaining the technical terminology and context of the original text.]

During the denoising process, the received prompt is a context prompt integrated with editing instructions, such as "A diptych containing two side-by-side images of the same man... the same man, but {make this man hold hold a basketball}". Meanwhile, during the denoising process, the original image inversion features are continuously injected into the noise image on the left side of the diptych, with no operations performed on the right-side noise. The final generated result will reconstruct the original image on the left side, while the right side will generate an edited result based the on meaning the holding a basketball><>.

Another training-free framework is based on Inpainting DiT (image completion, such as Flux.1 Fill), which is quite concise. It only requires the image the the left side of side ofych, entire right side set as thepainting area. prompt remains a context prompt integrated with editing instructions, and the output is the the edited image.

Overall, the purpose of both frameworks is to enable the model to receive reference images while editing based on contextual instructions. Although they demonstrate excellent editing effects, the ID of the man holding a still basketball still changes, the holding a can also changes her pose with a low generation success rate.

>Expert LoRA Fine-tuning and Test-time Significantly Improve Performance

Although the training-free method still has limited performance and low generation success rate, it can through subsequent fine-tuning.

Based on the simplicity of the inpainting framework, the authors used public editing datasets from the internet (MagicBrush 9k+OmniEdit 40k) for LoRA fine-tuning. The fine-tuning strategy is simple, just converting the the editing instructions in the dataset to a unified contextual instruction form: "A diptych containing two side-by-side images of the same scene, where the right scene is the same version as but + {editing editing instruction}".

The authors found that after fine-tuning, the model editing success rate significantly improved and could generalize to image editing types beyond many datasets.

However, the authors discovered that using ordinary LoRA still had insufficient success rates across different editing tasks,, with some tasks like Remove and style edits performing poorly.

The authors believe this is because different editing tasks require different feature processing modes, and a single LoRA struggles to learn processing methods for all editing types. Therefore, mixed training of multiple LoRA experts might be key to improving editing effects.

Thus, the authors borrowed the Mixture of Experts (MoE) method from the LLM domain, applied it to the DiT multimodal large model, and set LoRA as different experts for training, resulting in the final model.

Despite using the MoE+LoRA approach, the model's training parameters remain far fewer than SOTA models (0.2B vs 17B).

Table 1: Model Parameter and Performance Comparison

Table 2: Training Data Volume and Performance Comparison

Table 3: Performance Significantly Improved After LoRA Fine-tuning Compared to Training-Free, Further Increased After Architecture

With training-side completed, is there room for model performance improvement during inference? The authors found that different random initial noise produces different editing results, some good and some bad. How can the model automatically and quickly generate the best results for users?

To solve the "different initial noise editing effects vary" problem, the, early filter inference time scaling strategy suitable for image editing tasks.

Simply put, current most commonly used Flux, SD3, and other DiT text-to-image models are trained using flow matching techniques, enabling high-ity results with few inference steps (straight line). Many works have explored one one-step image generation DiT models. Therefore, the authors thought of using the first few steps to determine whether the current initial noise generation meets editing requirements, skskipping and considering the next candidate candidate if not satisfied.

Early Filter Inference Time Scaling

The case requires changing the sky to a night scene. Some noise candidates appear bright in the first 4 steps and remain bright even after a full 50-step inference, not meeting editing requirements. Therefore, a VLM can can be used as a judge to remove unsuitable candidates in the first few steps, saving inference step costs.

Additionally, the VLM can select the best option. Even if both complete the sky-to-night transformation, one edit might have stars twinkling in the sky, better capturing the night atmosphere, which theLand keep.

VIE-scoringE-Score evaluation shows that the time scaling strategy brings significant effect improvement

>Beyond quantitative evaluation, ICEdit's qualitative comparison with other models also demonstrates superior editing effects in instruction following, background>

<>, the proposed method uses an externaloLowithout changing the DiT model's original generation capability, has strong generalizability and DiT's inherent ability to produce more harmonious and effects such as automatically adding shadows shadows, reflections, font styles, etc.<><>

<>><://...io/ICEdit-gh-pages/

article is from the WeChat public account ""New Intelligence Element", edited by LRST, and published by 36Kr with authorization.

，保留且不要�>的其他部分一定要全部�英。

在这个案例中，我们将探讨如何利用大语言模型（LLM）和多模态模型（V型（MultiModalmod来决策制定中中的复杂性和不确定p。

: p, we to Large Language ModelsM) andmod(Multimodal) address the complexity and uncertainty in decision-making.

In this case, we will explore how to leverage (Models (LLM) and Multimodal Models to address uncertainty in.decision-making.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

ME News

Breaking News! The Year of China's RWA: A Compliant Channel Opens for Trillions of Yuan in Domestic Assets to Go Global

BlockTempo

Arthur Hayes speculates that the reason for the BTC crash is "institutional hedging operations": IBIT options saw a surge of $900 million.

BTC

1.43%

The Defiant

Bitcoin Selloff Sparks Hedge Fund Speculation Around BlackRock ETF

BTC

1.43%