Eager Embed V1
eager-embed-v1 is a multimodal dense embedding model built upon a Vision-Language Model (VLM). It is designed to efficiently index documents using both their visual and textual features.
Compared to multi-vector (ColBERT-like) architectures, eager-embed-v1 offers a strong balance between embedding dimensionality and retrieval accuracy, while maintaining efficiency. Unlike those approaches, it does not require a max-sim distance function, further simplifying the retrieval process.
Model Details
Model Description
- Developed by: Juan Pablo Balarini
- Funded by: Eagerworks
- Model type: Embedding model
- License: Apache 2.0
- Finetuned from model: Qwen3-VL-4B-Instruct
Model Sources
- Repository: eager-embed
How to Get Started with the Model
Load the model and define a helper function to encode messages:
import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from transformers.utils.import_utils import is_flash_attn_2_available
from qwen_vl_utils import process_vision_info
MODEL_NAME = "eagerworks/eager-embed-v1"
DEVICE = torch.device("cpu")
if torch.cuda.is_available():
DEVICE = torch.device("cuda:0")
elif torch.backends.mps.is_available():
DEVICE = torch.device("mps")
DTYPE = torch.bfloat16
processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = Qwen3VLForConditionalGeneration.from_pretrained(
MODEL_NAME,
attn_implementation=(
"flash_attention_2" if is_flash_attn_2_available() else None
),
dtype=DTYPE
).to(DEVICE).eval()
# Function to Encode Message
def encode_message(message):
with torch.no_grad():
texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True) + "<|endoftext|>"
image_inputs, video_inputs = process_vision_info(message)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
return_tensors="pt",
padding="longest",
).to(DEVICE)
model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)
last_hidden_state = model_outputs.hidden_states[-1]
embeddings = last_hidden_state[:, -1]
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1)
return embeddings
🌍 Multilingual Text Retrieval
example_query = "Query: What is the capital city of Uruguay?"
example_text_1 = "Montevideo es la capital y la ciudad más poblada de la República Oriental del Uruguay, así como la capital del departamento homónimo"
example_text_2 = "El río Uruguay es un río internacional que forma parte de la cuenca del Plata. Nace en Brasil, recorre unos 1.800 km y desemboca en el Río de la Plata"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]
sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))
print("Similarities:", sim1.item(), sim2.item())
📈 Image Document Retrieval (Image, Chart, PDF)
MAX_IMAGE_SIZE = 784
example_query = 'Query: Where can we find the animal llama?'
example_image_1 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/animal-llama.png"
example_image_2 = "https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/meta-llama.png"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]
image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2, 'resized_height': MAX_IMAGE_SIZE, 'resized_width': MAX_IMAGE_SIZE}]}]
sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))
print("Similarities:", sim1.item(), sim2.item())
Training Details
Training Data
Training was done on a computer with the following specs:
- 8x RTX 5090 for a total of 256 GB of VRAM
- AMD EPYC 9534 64-Core CPU (128 threads)
- 256 RAM
- 2TB SSD
Training Procedure
Training was done using the Tevatron framework and Deepspeed for parallel training.
Training Hyperparameters
More information on training parameters can be found here
Evaluation
Model was evaluated on the Vidore 1, 2 and 3 benchmarks. More info can be found here.
Citation
@article{EagerEmbed,
title={Eager Embed V1: Multimodal Dense Embeddings for Retrieval},
author={Juan Pablo Balarini},
year={2025},
publisher={Eagerworks},
url={https://github.com/eagerworks/eager-embed}
}
- Downloads last month
- 12
Model tree for eagerworks/eager-embed-v1
Base model
Qwen/Qwen3-VL-4B-Instruct