arxiv:2405.15668

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Published on May 24, 2024

Authors:

Mahmoud Afifi ,

Abstract

Multimodal LLMs generate textual representations from images for zero-shot classification using a cross-modal embedding space, achieving high accuracy without prompt engineering.

AI-generated summary

Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs. Using multimodal LLMs, we generate comprehensive textual representations from input images. These textual representations are then utilized to generate fixed-dimensional features in a cross-modal embedding space. Subsequently, these features are fused together to perform zero-shot classification using a linear classifier. Our method does not require prompt engineering for each dataset; instead, we use a single, straightforward set of prompts across all datasets. We evaluated our method on several datasets and our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets. On average, for ten benchmarks, our method achieved an accuracy gain of 6.2 percentage points, with an increase of 6.8 percentage points on the ImageNet dataset, compared to prior methods re-evaluated with the same setup. Our findings highlight the potential of multimodal LLMs to enhance computer vision tasks such as zero-shot image classification, offering a significant improvement over traditional methods.

View arXiv page View PDF GitHub 10 Add to collection

Community

mafifi

Paper author 1 day ago

Code: https://github.com/mahmoudnafifi/WDYS

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.15668 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.15668 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.15668 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.