Add Snowflake Arctic Embed L v2.0 model#562
Conversation
Add support for Snowflake/snowflake-arctic-embed-l-v2.0, a multilingual embedding model with the following features: - 1024 dimensions - 74 languages support - 8192 token context length - Based on XLM-RoBERTa architecture - Supports Matryoshka learning for dimension truncation - Apache 2.0 license Changes: - Added model configuration to supported_onnx_models list - Added canonical test vectors for validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Warning Rate limit exceeded@aaronspring has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 32 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughA new DenseModelDescription for Snowflake/snowflake-arctic-embed-l-v2.0 (dim=1024, embedding task with query_prefix "query: " and passage_prefix "", ONNX file paths, license, size, and source) was added to Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Implement automatic prefix handling for models with task-specific prefixes: - Added tasks field to model configuration with query_prefix and passage_prefix - Implemented query_embed() method to automatically prepend "query: " prefix - Implemented passage_embed() method (no prefix for this model) - Both methods check for tasks configuration and apply prefixes dynamically This enables optimal retrieval performance as recommended in the model documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
fastembed/text/onnx_embedding.py (1)
331-333: Simplify the condition by removing redundanthasattrcheck.Since
tasksis defined inDenseModelDescriptionwith a default factory (see relevant code snippets),hasattr(self.model_description, "tasks")will always returnTrue. The checkself.model_description.tasksalone is sufficient to determine if tasks are configured.Apply this diff:
- if hasattr(self.model_description, "tasks") and self.model_description.tasks: + if self.model_description.tasks:The same simplification applies to line 360 in
passage_embed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fastembed/text/onnx_embedding.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/text/onnx_embedding.py (2)
fastembed/common/model_description.py (1)
DenseModelDescription(35-40)fastembed/text/text_embedding_base.py (3)
query_embed(46-61)embed(22-29)passage_embed(31-44)
🔇 Additional comments (3)
fastembed/text/onnx_embedding.py (3)
317-346: Implementation correctly handles prefix application and input types.The method properly:
- Applies the query prefix when configured in the model's tasks
- Handles both single string and iterable inputs correctly
- Maintains backward compatibility for models without task-specific prefixes
- Delegates to the existing
embedmethod appropriately
348-369: Implementation correctly handles passage prefix application.The method properly:
- Applies the passage prefix when configured in the model's tasks
- Converts the input iterable to a list when applying prefixes (acceptable for this use case)
- Maintains backward compatibility for models without task-specific prefixes
- Delegates to the existing
embedmethod appropriatelyNote: The same
hasattrsimplification mentioned in the previous comment applies to line 360.
171-190: Standardize model identifier casing for consistency.The new model uses
Snowflake/(capital S) while all existing Snowflake Arctic Embed models in this file usesnowflake/(lowercase). HuggingFace Hub treats repository identifiers case-insensitively and resolves the canonical repo regardless of casing, so both forms work in practice. However, for code consistency, either:
- Change this model to
snowflake/snowflake-arctic-embed-l-v2.0(lowercase) to match lines 112, 124, 136, 148, 160, or- Document why capital S is used if it's intentional (capital S is the canonical HuggingFace identifier)
Fixes based on CodeRabbit review: 1. Remove redundant hasattr checks - tasks field has default factory 2. Fix model identifier casing from Snowflake/ to snowflake/ for consistency 3. Add comprehensive tests for prefix functionality: - test_query_passage_prefix: Verifies query prefix is applied correctly - test_prefix_backward_compatibility: Ensures models without prefix config work All tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Summary
Add support for
Snowflake/snowflake-arctic-embed-l-v2.0, a multilingual embedding model that improves upon the original Arctic Embed L model with expanded language support and longer context.Model Features
query:Changes
supported_onnx_modelslist infastembed/text/onnx_embedding.pytests/test_text_onnx_embeddings.pyonnx/model.onnxandonnx/model.onnx_datafilesTest Plan
References
🤖 Generated with Claude Code