Eager Embed V1

eager-embed-v1 is a multimodal dense embedding model built upon a Vision-Language Model (VLM). It is designed to efficiently index documents using both their visual and textual features.

Compared to multi-vector (ColBERT-like) architectures, eager-embed-v1 offers a strong balance between embedding dimensionality and retrieval accuracy, while maintaining efficiency. Unlike those approaches, it does not require a max-sim distance function, further simplifying the retrieval process.

Model Details

Model Description

  • Developed by: Juan Pablo Balarini
  • Funded by: Eagerworks
  • Model type: Embedding model
  • License: Apache 2.0
  • Finetuned from model: Qwen3-VL-4B-Instruct

Model Sources

How to Get Started with the Model

Load the model and define a helper function to encode messages:

import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from transformers.utils.import_utils import is_flash_attn_2_available
from qwen_vl_utils import process_vision_info

MODEL_NAME = "eagerworks/eager-embed-v1"
DEVICE = torch.device("cpu")
if torch.cuda.is_available():
    DEVICE = torch.device("cuda:0")
elif torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
DTYPE = torch.bfloat16

processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    attn_implementation=(
        "flash_attention_2" if is_flash_attn_2_available() else None
    ),
    dtype=DTYPE
).to(DEVICE).eval()

# Function to Encode Message
def encode_message(message):
    with torch.no_grad():
        texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True) + "<|endoftext|>"
        image_inputs, video_inputs = process_vision_info(message)

        inputs = processor(
            text=texts,
            images=image_inputs,
            videos=video_inputs,
            return_tensors="pt",
            padding="longest",
        ).to(DEVICE)

        model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)

        last_hidden_state = model_outputs.hidden_states[-1]
        embeddings = last_hidden_state[:, -1]
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1)
        return embeddings

🌍 Multilingual Text Retrieval

example_query = "Query: What is the capital city of Uruguay?"
example_text_1 = "Montevideo es la capital y la ciudad más poblada de la República Oriental del Uruguay, así como la capital del departamento homónimo"
example_text_2 = "El río Uruguay es un río internacional que forma parte de la cuenca del Plata. Nace en Brasil, recorre unos 1.800 km y desemboca en el Río de la Plata"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))

print("Similarities:", sim1.item(), sim2.item())

📈 Image Document Retrieval (Image, Chart, PDF)

MAX_IMAGE_SIZE = 784
example_query = 'Query: Where can we find the animal llama?'
example_image_1 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/animal-llama.png"
example_image_2 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/meta-llama.png"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]
image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))

print("Similarities:", sim1.item(), sim2.item())

Training Details

Training Data

Training was done on a computer with the following specs:

  • 8x RTX 5090 for a total of 256 GB of VRAM
  • AMD EPYC 9534 64-Core CPU (128 threads)
  • 256 RAM
  • 2TB SSD

Training Procedure

Training was done using the Tevatron framework and Deepspeed for parallel training.

Training Hyperparameters

More information on training parameters can be found here

Evaluation

Model was evaluated on the Vidore 1, 2 and 3 benchmarks. More info can be found here.

Citation

@article{EagerEmbed,
  title={Eager Embed V1: Multimodal Dense Embeddings for Retrieval},
  author={Juan Pablo Balarini},
  year={2025},
  publisher={Eagerworks},
  url={https://github.com/eagerworks/eager-embed}
}
Downloads last month
12
Safetensors
Model size
4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eagerworks/eager-embed-v1

Adapter
(9)
this model