feat(test-emitter): Add evaluation performance testing framework#182
Merged
zhirafovod merged 1 commit intomainfrom Feb 2, 2026
Merged
feat(test-emitter): Add evaluation performance testing framework#182zhirafovod merged 1 commit intomainfrom
zhirafovod merged 1 commit intomainfrom
Conversation
Features: - Test Emitter captures all telemetry (spans, metrics, events, evaluation results) in memory - CLI tool for testing evaluation framework performance and validating metrics - 120 test samples across 6 categories (neutral, subtle_bias, subtle_toxicity, hallucination, irrelevant, negative_sentiment) - Trace-based sampling with configurable sample rate - Concurrent evaluation mode with configurable workers - Threshold-based validation with score deviation metrics (MAE, RMSE) - Idle timeout (60s) to prevent hanging on stalled evaluations - JSON export of results and failures for debugging CLI Options: - --samples N: Number of test samples to use (default: 20) - --concurrent: Enable concurrent evaluation mode - --workers N: Number of concurrent workers (default: 4) - --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (def_CONCURRENT: Enable- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate eSE_URL: Custom LLM endpoint - DEEPEVAL_LLM_MODEL: Model name for DeepEval
adityamehra
approved these changes
Feb 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a package with perf test script and a test emitter, which samples and concurrently enqueues invocations and help to validate concurrent performance of the evaluator. The evaluator was run with a tiny 1.2B local model and mostly focused on correct metrics for every sampled invocation. Takes about 250s for 120 spans with 50% sampling on M3 mac.
Core changes:
util/opentelemetry-util-genai-emitters-testnew package with perf testing program and with the test emitter. Make sure to reinstallutil/opentelemetry-util-genai-emitters*for proper registrationutil/opentelemetry-util-genai-evals/src/opentelemetry/util/genai/evals/manager.py- fixes/additions of helper functions to eval managerdocs/feat-evals-perf.md- feature design and implementation details./var/tmpby defaultFeatures:
CLI Options: