Add ESM2 model for protein embeddings (WIP)#598
Add ESM2 model for protein embeddings (WIP)#598nleroy917 wants to merge 1 commit intoqdrant:mainfrom
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a new protein embedding feature to the fastembed library. A new module Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @fastembed/bio/protein_embedding.py:
- Around line 65-66: Truncation currently slices input_ids to self.max_length
and can drop the EOS token; change the logic so when len(input_ids) >
self.max_length you preserve the EOS: if you have a known EOS id (use
self.eos_token_id) truncate to self.max_length - 1 and set input_ids[-1] =
self.eos_token_id, or if the original input_ids contained an EOS token, move or
copy that EOS into the final position instead of simply slicing; update the
block around input_ids and self.max_length in protein_embedding.py to implement
this EOS-preserving truncation.
🧹 Nitpick comments (2)
fastembed/bio/protein_embedding.py (2)
204-204: Consider replacingassertwith explicit exception.
assertstatements are stripped in optimized mode (python -O). While this is internal code, a proper check prevents silent failures if the method is called before_load_onnx_model.Suggested fix
- assert self.tokenizer is not None + if self.tokenizer is None: + raise RuntimeError("Tokenizer not initialized. Call _load_onnx_model first.")
367-367: Annotate mutable class attribute withClassVar.Per static analysis, mutable class attributes should use
ClassVarto clarify they are not instance attributes.Suggested fix
+from typing import ClassVar + class ProteinEmbedding(ProteinEmbeddingBase): - EMBEDDINGS_REGISTRY: list[Type[ProteinEmbeddingBase]] = [OnnxProteinEmbedding] + EMBEDDINGS_REGISTRY: ClassVar[list[Type[ProteinEmbeddingBase]]] = [OnnxProteinEmbedding]
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
fastembed/__init__.pyfastembed/bio/__init__.pyfastembed/bio/protein_embedding.py
🧰 Additional context used
🧬 Code graph analysis (2)
fastembed/__init__.py (1)
fastembed/bio/protein_embedding.py (1)
ProteinEmbedding(355-447)
fastembed/bio/__init__.py (1)
fastembed/bio/protein_embedding.py (1)
ProteinEmbedding(355-447)
🪛 Ruff (0.14.10)
fastembed/bio/protein_embedding.py
148-148: Avoid specifying long messages outside the exception class
(TRY003)
192-192: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Unused method argument: kwargs
(ARG002)
224-224: Unused method argument: kwargs
(ARG002)
231-231: Avoid specifying long messages outside the exception class
(TRY003)
305-305: Unused method argument: parallel
(ARG002)
367-367: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
425-428: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (8)
fastembed/bio/protein_embedding.py (6)
14-25: LGTM!Model metadata is well-defined with appropriate fields for ESM-2: dimension (480), source, license (MIT), and required files.
101-158: LGTM!Base class properly defines the interface with appropriate caching for
embedding_sizeand case-insensitive model lookup.
223-239: LGTM!Mean pooling implementation is correct: expands attention mask for broadcasting, computes masked sum, clips to avoid division by zero, and normalizes the output.
301-327: LGTM!The
embedmethod correctly handles single/iterable inputs, lazy model loading, and batch processing. The unusedparallelparameter is appropriately documented as "not yet supported" for this WIP.
333-352: LGTM!Worker correctly initializes with
threads=1for parallel worker processes and properly yields batch results with indices.
385-428: LGTM!The initialization properly iterates through the registry with case-insensitive matching and provides a helpful error message with guidance on listing supported models.
fastembed/bio/__init__.py (1)
1-3: LGTM!Clean re-export aligns with the usage example in
ProteinEmbedding's docstring.fastembed/__init__.py (1)
3-3: LGTM!Import and
__all__export follow the established pattern for other embedding types in this package.Also applies to: 23-23
This directly addresses #597
All Submissions:
New Feature Submissions:
pre-commitwithpip3 install pre-commitand set up hooks withpre-commit install?New models submission: