Google Vision Banana: The "GPT-3 Moment" for Computer Vision? Raw Image Model Outperforms Dedicated Visual Understanding Models

This article is machine translated
Show original
According to ME News, on April 23 (UTC+8), according to Beating's monitoring, the Google team (authors including Kaiming He and Saining Xie) published a paper proposing Vision Banana. This model involves lightweight instruction fine-tuning of their own image generation model, Nano Banana Pro (Gemini 3 Pro Image), transforming it into a general-purpose visual understanding model. The core approach is to uniformly parameterize the output of all visual tasks as RGB images, allowing perception tasks such as segmentation, depth estimation, and surface normal estimation to be completed through image generation, eliminating the need for dedicated architectures or training losses for each task type. Evaluations cover two major categories of tasks: image segmentation and 3D geometric inference. In segmentation, semantic segmentation (labeling each pixel in an image with a category, such as "road surface," "pedestrian," and "vehicle") outperforms the dedicated segmentation model SAM 3 by 4.7 percentage points on Cityscapes; and index-based expression segmentation (finding and segmenting corresponding objects based on natural language descriptions, such as "the dog wearing a hat on the left") also outperforms SAM 3 Agent. However, it still lags behind SAM 3 in instance segmentation (distinguishing different individuals of the same category, such as labeling five dogs in an image). In 3D, metric depth estimation (calculating the actual physical distance from each pixel to the camera from a single image) achieves an average accuracy of 0.929 on four standard datasets, higher than the dedicated model Depth Anything V3's 0.918, and is trained entirely on synthetic data without using real depth data, requiring no camera parameters during inference. Surface normal estimation (inferring the orientation of object surfaces) achieves the best results on three indoor benchmarks. Fine-tuning simply mixes a small amount of visual task data into the original image generation training data, and the model's image generation ability remains largely unaffected: it matches the original Nano Banana Pro in generation quality evaluation. The paper argues that the role of image generation pre-training in the visual domain is similar to that of text generation pre-training in the language domain: the model has already learned the internal representations needed to understand images during the process of learning to generate them, and fine-tuning simply releases these representations. (Source: ME)

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments