Skip to content

feat(test-emitter): Add evaluation performance testing framework#182

Merged
zhirafovod merged 1 commit intomainfrom
feature/eval-perf
Feb 2, 2026
Merged

feat(test-emitter): Add evaluation performance testing framework#182
zhirafovod merged 1 commit intomainfrom
feature/eval-perf

Conversation

@zhirafovod
Copy link
Contributor

@zhirafovod zhirafovod commented Jan 31, 2026

Adds a package with perf test script and a test emitter, which samples and concurrently enqueues invocations and help to validate concurrent performance of the evaluator. The evaluator was run with a tiny 1.2B local model and mostly focused on correct metrics for every sampled invocation. Takes about 250s for 120 spans with 50% sampling on M3 mac.

Core changes:

  • util/opentelemetry-util-genai-emitters-test new package with perf testing program and with the test emitter. Make sure to reinstall util/opentelemetry-util-genai-emitters* for proper registration
  • util/opentelemetry-util-genai-evals/src/opentelemetry/util/genai/evals/manager.py - fixes/additions of helper functions to eval manager
  • docs/feat-evals-perf.md - feature design and implementation details.
  • dumps captured invocations and evaluations results, as well as detailed failures list (both bad evals or failures of evals) in /var/tmp by default

Features:

  • Test Emitter captures all telemetry (spans, metrics, events, evaluation results) in memory
  • CLI tool for testing evaluation framework performance and validating metrics
  • 120 test samples across 6 categories (neutral, subtle_bias, subtle_toxicity, hallucination, irrelevant, negative_sentiment)
  • Trace-based sampling with configurable sample rate
  • Concurrent evaluation mode with configurable workers
  • Threshold-based validation with score deviation metrics (MAE, RMSE)
  • Idle timeout (60s) to prevent hanging on stalled evaluations
  • JSON export of results and failures for debugging

CLI Options:

  • --samples N: Number of test samples to use (default: 20)
  • --concurrent: Enable concurrent evaluation mode
  • --workers N: Number of concurrent workers (default: 4)
  • --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (def_CONCURRENT: Enable- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate eSE_URL: Custom LLM endpoint
  • DEEPEVAL_LLM_BASE_URL: Model URL for DeepEval
  • DEEPEVAL_LLM_MODEL: Model name for DeepEval

Features:
- Test Emitter captures all telemetry (spans, metrics, events, evaluation results) in memory
- CLI tool for testing evaluation framework performance and validating metrics
- 120 test samples across 6 categories (neutral, subtle_bias, subtle_toxicity, hallucination, irrelevant, negative_sentiment)
- Trace-based sampling with configurable sample rate
- Concurrent evaluation mode with configurable workers
- Threshold-based validation with score deviation metrics (MAE, RMSE)
- Idle timeout (60s) to prevent hanging on stalled evaluations
- JSON export of results and failures for debugging

CLI Options:
- --samples N: Number of test samples to use (default: 20)
- --concurrent: Enable concurrent evaluation mode
- --workers N: Number of concurrent workers (default: 4)
- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate every Nth trace (def_CONCURRENT: Enable- --sample-rate N: Evaluate every Nth trace (default: 2 = 50%)- --sample-rate N: Evaluate eSE_URL: Custom LLM endpoint
- DEEPEVAL_LLM_MODEL: Model name for DeepEval
@zhirafovod zhirafovod merged commit 53bc8e3 into main Feb 2, 2026
14 checks passed
@zhirafovod zhirafovod deleted the feature/eval-perf branch February 2, 2026 18:25
@github-actions github-actions bot locked and limited conversation to collaborators Feb 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants