πŸ“‹ Model Description


pipeline_tag: sentence-similarity tags:
  • gguf
  • embedding
  • qwen3
  • llama-cpp
  • jina-embeddings-v5
  • feature-extraction
  • mteb
  • vllm
  • sentence-transformers
language:
  • multilingual
base_model: jinaai/jina-embeddings-v5-text-small basemodelrelation: quantized inference: false license: cc-by-nc-4.0 library_name: llama.cpp



Jina AI: Your Search Foundation, Supercharged!

jina-embeddings-v5-text-small-retrieval: Retrieval-Targeted Embedding Distillation

Blog | Elastic Inference Service | ArXiv | Blog

Model Overview


jina-embeddings-v5-text Architecture


jina-embeddings-v5-text-small-retrieval is a compact, high-performance text embedding model designed for information retrieval.

It is part of the jina-embeddings-v5-text model family, which also includes jina-embeddings-v5-text-nano, a smaller model for more resource-constrained use cases.

Trained using a novel approach that combines distillation with task-specific contrastive losses, jina-embeddings-v5-text-small-retrieval outperforms existing state-of-the-art models of similar size across diverse embedding benchmarks.




FeatureValue
Parameters677M
Supported Tasksretrieval
Max Sequence Length32768
Embedding Dimension1024
Matryoshka Dimensions32, 64, 128, 256, 512, 768, 1024
Pooling StrategyLast-token pooling
Base Modeljinaai/jina-embeddings-v5-text-small


MMTEB Multilingual Benchmark


MTEB English Benchmark


Retrieval Benchmark Results

Training and Evaluation

For training details and evaluation results, see our technical report.

Usage


Requirements

The following Python packages are required:

  • transformers>=5.1.0
  • torch>=2.8.0
  • peft>=0.15.2
  • vllm>=0.15.1

Optional / Recommended

  • flash-attention: Installing flash-attention is recommended for improved inference speed and efficiency, but not mandatory.
  • sentence-transformers: If you want to use the model via the sentence-transformers interface, install this package as well.


via Elastic Inference Service

The fastest way to use v5-text in production. Elastic Inference Service (EIS) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment.

PUT inference/textembedding/jina-v5
{
  "service": "elastic",
  "service_settings": {
    "model_id": "jina-embeddings-v5-text-small"
  }
}

See the Elastic Inference Service documentation for setup details.


via sentence-transformers

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
"jinaai/jina-embeddings-v5-text-small-retrieval",
model_kwargs={"dtype": torch.bfloat16}, # Recommended for GPUs
configkwargs={"attnimplementation": "flashattention_2"}, # Recommended but optional
)

Optional: set truncate_dim in encode() to control embedding size

query = "Which planet is known as the Red Planet?"
documents = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet.",
]

Encode query and documents

queryembeddings = model.encode(sentences=query, promptname="query") documentembeddings = model.encode(sentences=documents, promptname="document") print(queryembeddings.shape, documentembeddings.shape)

(1024,) (4, 1024)

similarity = model.similarity(queryembeddings, documentembeddings)
print(similarity)

tensor([[0.4860, 0.7611, 0.5914, 0.6188]])



via vLLM

from vllm import LLM
from vllm.config.pooler import PoolerConfig

Initialize model

name = "jinaai/jina-embeddings-v5-text-small-retrieval" model = LLM( model=name, dtype="float16", runner="pooling", poolerconfig=PoolerConfig(seqpooling_type="LAST", normalize=True), )

Create text prompts

query = "Overview of climate change impacts on coastal cities" query_prompt = f"Query: {query}"

document = "The impacts of climate change on coastal cities are significant.."
document_prompt = f"Document: {document}"

Encode all prompts

prompts = [queryprompt, documentprompt] outputs = model.encode(prompts, pooling_task="embed")


via Text Embeddings Inference

  • Via Docker on CPU:
docker run -p 8080:80 \
    ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
    --model-id jinaai/jina-embeddings-v5-text-small-retrieval \
    --dtype float32 --pooling last-token
  • Via Docker on NVIDIA GPU (Turing, Ampere, Ada Lovelace, Hopper or Blackwell):
docker run --gpus all --shm-size 1g -p 8080:80 \
    ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 \
    --model-id jinaai/jina-embeddings-v5-text-small-retrieval \
    --dtype float16 --pooling last-token

Alternatively, you can also run with cargo, more information can be found in the Text Embeddings Inference documentation.

Send a request to /v1/embeddings to generate embeddings via the OpenAI Embeddings API:

curl -X POST http://127.0.0.1:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jinaai/jina-embeddings-v5-text-small-retrieval",
    "input": [
      "Query: Overview of climate change impacts on coastal cities",
      "Document: The impacts of climate change on coastal cities are significant...",
    ]
  }'

Or rather via the Text Embeddings Inference API specification instead, to prevent from manually formatting the inputs:

curl -X POST http://127.0.0.1:8080/embed \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Overview of climate change impacts on coastal cities",
    "prompt_name": "query",
  }'


via llama.cpp (GGUF)
After installing llama.cpp one can run llama-server to host the embedding model as OpenAI API compatible HTTP server with the respective model version:

llama-server -hf jinaai/jina-embeddings-v5-text-small-retrieval:F16 --embedding --pooling last -ub 32768

Client:

curl -X POST "http://127.0.0.1:8080/v1/embeddings" \
  -H "Content-Type: application/json" \
  -d '{
    "input": [
      "Query: A beautiful sunset over the beach",
      "Query: Un beau coucher de soleil sur la plage",
      "Document: ζ΅·ζ»©δΈŠηΎŽδΈ½ηš„ζ—₯落",
      "Document: ζ΅œθΎΊγ«ζ²ˆγ‚€ηΎŽγ—γ„ε€•ζ—₯",
      "Document: Golden sunlight melts into the horizon, painting waves in warm amber and rose, while the sky whispers goodnight to the quiet, endless sea."
    ]
  }'

License

The model is licensed under CC BY-NC 4.0. For commercial use, please contact us.

Citation

If you find jina-embeddings-v5-text-small-retrieval useful in your research, please cite the following paper:

@misc{akram2026jinaembeddingsv5texttasktargetedembeddingdistillation,
      title={jina-embeddings-v5-text: Task-Targeted Embedding Distillation}, 
      author={Mohammad Kalim Akram and Saba Sturua and Nastia Havriushenko and Quentin Herreros and Michael GΓΌnther and Maximilian Werk and Han Xiao},
      year={2026},
      eprint={2602.15547},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.15547}, 
}

πŸ“‚ GGUF File List

πŸ“ Filename πŸ“¦ Size ⚑ Download
v5-small-retrieval-F16.gguf
LFS FP16
1.12 GB Download
v5-small-retrieval-IQ1_M.gguf
LFS
206.04 MB Download
v5-small-retrieval-IQ1_S.gguf
LFS
198.38 MB Download
v5-small-retrieval-IQ2_M.gguf
LFS Q2
252.64 MB Download
v5-small-retrieval-IQ2_XXS.gguf
LFS Q2
218.82 MB Download
v5-small-retrieval-IQ4_NL.gguf
LFS Q4
363.89 MB Download
v5-small-retrieval-IQ4_XS.gguf
LFS Q4
350.76 MB Download
v5-small-retrieval-Q2_K.gguf
LFS Q2
282.51 MB Download
v5-small-retrieval-Q3_K_M.gguf
LFS Q3
331.05 MB Download
v5-small-retrieval-Q4_K_M.gguf
Recommended LFS Q4
378.33 MB Download
v5-small-retrieval-Q5_K_M.gguf
LFS Q5
423.83 MB Download
v5-small-retrieval-Q5_K_S.gguf
LFS Q5
416.39 MB Download
v5-small-retrieval-Q6_K.gguf
LFS Q6
472.17 MB Download
v5-small-retrieval-Q8_0.gguf
LFS Q8
609.82 MB Download