feat: Add the ESM2 protein embedding model#600
feat: Add the ESM2 protein embedding model#600nleroy917 wants to merge 2 commits intoqdrant:mainfrom
Conversation
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughThis PR introduces a complete protein embedding functionality to the fastembed library. A new ProteinEmbedding class is added that computes embeddings for amino acid sequences using ONNX-based models. The implementation includes tokenizer loading from model files (with fallback support), ONNX model integration, mean-pooling post-processing with attention masking, batching, and lazy loading capabilities. The feature is exposed through the public API via module exports, accompanied by comprehensive test coverage and documentation examples. Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Added a new
biomodule that introduces embedding models for biological sequence data (Proteins, DNA, etc). The plan is to eventually get to some advanced models like Tahoe-x1, but starting simple for now.Added
ProteinEmbeddingclass for protein sequence embeddings using ESM-2 modelstokenizerslibrary for tokenization (consistent with other models)