feat(test-emitter): Add evaluation performance testing framework by zhirafovod · Pull Request #182 · signalfx/splunk-otel-python-contrib

zhirafovod · 2026-01-31T06:03:08Z

Adds a package with perf test script and a test emitter, which samples and concurrently enqueues invocations and help to validate concurrent performance of the evaluator. The evaluator was run with a tiny 1.2B local model and mostly focused on correct metrics for every sampled invocation. Takes about 250s for 120 spans with 50% sampling on M3 mac.

Core changes:

util/opentelemetry-util-genai-emitters-test new package with perf testing program and with the test emitter. Make sure to reinstall util/opentelemetry-util-genai-emitters* for proper registration
util/opentelemetry-util-genai-evals/src/opentelemetry/util/genai/evals/manager.py - fixes/additions of helper functions to eval manager
docs/feat-evals-perf.md - feature design and implementation details.
dumps captured invocations and evaluations results, as well as detailed failures list (both bad evals or failures of evals) in /var/tmp by default

Features:

Test Emitter captures all telemetry (spans, metrics, events, evaluation results) in memory
CLI tool for testing evaluation framework performance and validating metrics
120 test samples across 6 categories (neutral, subtle_bias, subtle_toxicity, hallucination, irrelevant, negative_sentiment)
Trace-based sampling with configurable sample rate
Concurrent evaluation mode with configurable workers
Threshold-based validation with score deviation metrics (MAE, RMSE)
Idle timeout (60s) to prevent hanging on stalled evaluations
JSON export of results and failures for debugging

CLI Options:

--samples N: Number of test samples to use (default: 20)
--concurrent: Enable concurrent evaluation mode
--workers N: Number of concurrent workers (default: 4)
--sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (def_CONCURRENT: Enable- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate eSE_URL: Custom LLM endpoint
DEEPEVAL_LLM_BASE_URL: Model URL for DeepEval
DEEPEVAL_LLM_MODEL: Model name for DeepEval

Features: - Test Emitter captures all telemetry (spans, metrics, events, evaluation results) in memory - CLI tool for testing evaluation framework performance and validating metrics - 120 test samples across 6 categories (neutral, subtle_bias, subtle_toxicity, hallucination, irrelevant, negative_sentiment) - Trace-based sampling with configurable sample rate - Concurrent evaluation mode with configurable workers - Threshold-based validation with score deviation metrics (MAE, RMSE) - Idle timeout (60s) to prevent hanging on stalled evaluations - JSON export of results and failures for debugging CLI Options: - --samples N: Number of test samples to use (default: 20) - --concurrent: Enable concurrent evaluation mode - --workers N: Number of concurrent workers (default: 4) - --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (def_CONCURRENT: Enable- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate eSE_URL: Custom LLM endpoint - DEEPEVAL_LLM_MODEL: Model name for DeepEval

zhirafovod requested review from a team as code owners January 31, 2026 06:03

zhirafovod assigned adityamehra Feb 2, 2026

adityamehra approved these changes Feb 2, 2026

View reviewed changes

zhirafovod merged commit 53bc8e3 into main Feb 2, 2026
14 checks passed

zhirafovod deleted the feature/eval-perf branch February 2, 2026 18:25

github-actions bot locked and limited conversation to collaborators Feb 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(test-emitter): Add evaluation performance testing framework#182

feat(test-emitter): Add evaluation performance testing framework#182
zhirafovod merged 1 commit intomainfrom
feature/eval-perf

zhirafovod commented Jan 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhirafovod commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhirafovod commented Jan 31, 2026 •

edited

Loading