Add ESM2 model for protein embeddings (WIP) by nleroy917 · Pull Request #598 · qdrant/fastembed

nleroy917 · 2026-01-12T21:24:34Z

This directly addresses #597

All Submissions:

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass the existing tests?
Have you added tests for your feature?
Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

New models submission:

Have you added an explanation of why it's important to include this model?
Have you added tests for the new model? Were canonical values for tests computed via the original model?
Have you added the code snippet for how canonical values were computed?
Have you successfully ran tests with your changes locally?

coderabbitai · 2026-01-12T21:27:51Z

📝 Walkthrough

Walkthrough

This pull request introduces a new protein embedding feature to the fastembed library. A new module fastembed/bio/protein_embedding.py implements ONNX-based protein sequence embeddings with support for the ESM-2 model (facebook/esm2_t12_35M_UR50D). The implementation includes a tokenizer for protein sequences, abstract and concrete embedding classes for ONNX model loading and inference, output post-processing with mean pooling and normalization, multi-worker orchestration support, and a high-level public API. The ProteinEmbedding class is exported through the module hierarchy by updating fastembed/bio/__init__.py and fastembed/__init__.py.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding ESM2 model support for protein embeddings, which aligns with the new ProteinEmbedding implementation and model introduction.
Description check	✅ Passed	The description references issue #597 and includes relevant contributor checklist items for new feature/model submissions that relate to the protein embedding implementation being added.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @fastembed/bio/protein_embedding.py:
- Around line 65-66: Truncation currently slices input_ids to self.max_length
and can drop the EOS token; change the logic so when len(input_ids) >
self.max_length you preserve the EOS: if you have a known EOS id (use
self.eos_token_id) truncate to self.max_length - 1 and set input_ids[-1] =
self.eos_token_id, or if the original input_ids contained an EOS token, move or
copy that EOS into the final position instead of simply slicing; update the
block around input_ids and self.max_length in protein_embedding.py to implement
this EOS-preserving truncation.

🧹 Nitpick comments (2)

fastembed/bio/protein_embedding.py (2)
204-204: Consider replacing assert with explicit exception.

assert statements are stripped in optimized mode (python -O). While this is internal code, a proper check prevents silent failures if the method is called before _load_onnx_model.
Suggested fix
-        assert self.tokenizer is not None
+        if self.tokenizer is None:
+            raise RuntimeError("Tokenizer not initialized. Call _load_onnx_model first.")
367-367: Annotate mutable class attribute with ClassVar.

Per static analysis, mutable class attributes should use ClassVar to clarify they are not instance attributes.
Suggested fix
+from typing import ClassVar
+
 class ProteinEmbedding(ProteinEmbeddingBase):
-    EMBEDDINGS_REGISTRY: list[Type[ProteinEmbeddingBase]] = [OnnxProteinEmbedding]
+    EMBEDDINGS_REGISTRY: ClassVar[list[Type[ProteinEmbeddingBase]]] = [OnnxProteinEmbedding]

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 800f388 and ad715dc.

📒 Files selected for processing (3)

fastembed/__init__.py
fastembed/bio/__init__.py
fastembed/bio/protein_embedding.py

🧰 Additional context used

🧬 Code graph analysis (2)

fastembed/__init__.py (1)

fastembed/bio/protein_embedding.py (1)

ProteinEmbedding (355-447)

fastembed/bio/__init__.py (1)

fastembed/bio/protein_embedding.py (1)

ProteinEmbedding (355-447)

🪛 Ruff (0.14.10)

fastembed/bio/protein_embedding.py

148-148: Avoid specifying long messages outside the exception class

(TRY003)

192-192: Avoid specifying long messages outside the exception class

(TRY003)

195-195: Unused method argument: kwargs

(ARG002)

224-224: Unused method argument: kwargs

(ARG002)

231-231: Avoid specifying long messages outside the exception class

(TRY003)

305-305: Unused method argument: parallel

(ARG002)

367-367: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

425-428: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.10.x on ubuntu-latest test
GitHub Check: Python 3.11.x on ubuntu-latest test
GitHub Check: Python 3.12.x on ubuntu-latest test
GitHub Check: Python 3.13.x on ubuntu-latest test
GitHub Check: Python 3.13.x on windows-latest test

🔇 Additional comments (8)

fastembed/bio/protein_embedding.py (6)

14-25: LGTM!

Model metadata is well-defined with appropriate fields for ESM-2: dimension (480), source, license (MIT), and required files.

101-158: LGTM!

Base class properly defines the interface with appropriate caching for embedding_size and case-insensitive model lookup.

223-239: LGTM!

Mean pooling implementation is correct: expands attention mask for broadcasting, computes masked sum, clips to avoid division by zero, and normalizes the output.

301-327: LGTM!

The embed method correctly handles single/iterable inputs, lazy model loading, and batch processing. The unused parallel parameter is appropriately documented as "not yet supported" for this WIP.

333-352: LGTM!

Worker correctly initializes with threads=1 for parallel worker processes and properly yields batch results with indices.

385-428: LGTM!

The initialization properly iterates through the registry with case-insensitive matching and provides a helpful error message with guidance on listing supported models.

fastembed/bio/__init__.py (1)

1-3: LGTM!

Clean re-export aligns with the usage example in ProteinEmbedding's docstring.

fastembed/__init__.py (1)

3-3: LGTM!

Import and __all__ export follow the established pattern for other embedding types in this package.

Also applies to: 23-23

fastembed/bio/protein_embedding.py

add ESM2 model

ad715dc

coderabbitai bot reviewed Jan 12, 2026

View reviewed changes

fastembed/bio/protein_embedding.py Show resolved Hide resolved

nleroy917 closed this by deleting the head repository Jan 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ESM2 model for protein embeddings (WIP)#598

Add ESM2 model for protein embeddings (WIP)#598
nleroy917 wants to merge 1 commit intoqdrant:mainfrom
nleroy917:main

nleroy917 commented Jan 12, 2026

Uh oh!

coderabbitai bot commented Jan 12, 2026

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nleroy917 commented Jan 12, 2026

All Submissions:

New Feature Submissions:

New models submission:

Uh oh!

coderabbitai bot commented Jan 12, 2026

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant