A/B Testing Calculator & Statistical Significance Analysis for Python
π Try the Live Calculator β pyexpstats.vercel.app
pyexpstats is a Python library and web-based A/B testing calculator for experiment analysis, sample size calculation, and statistical significance testing. Whether you're running conversion rate optimization (CRO) experiments, analyzing split tests, or calculating statistical power, pyexpstats provides the tools you need.
- A/B Test Significance Calculator β Analyze experiments with Z-tests, t-tests, and chi-square tests
- Sample Size Calculator β Plan experiments with proper statistical power (80%, 90%, etc.)
- Multi-Variant Testing (A/B/n) β Compare multiple variants with automatic Bonferroni correction
- Conversion Rate Analysis β Binary outcome testing for signups, purchases, clicks
- Revenue & Magnitude Testing β Continuous metrics like AOV, time on site, order value
- Survival Analysis β Time-to-event analysis with Kaplan-Meier curves and log-rank tests
- Difference-in-Differences β Causal inference for quasi-experimental designs
- Confidence Intervals β Visualize uncertainty in your experiment results
- Stakeholder Reports β Generate plain-language markdown summaries
No installation needed! Use our free online A/B testing calculator at:
Calculate sample sizes, analyze experiment results, and determine statistical significance β all in your browser.
- Installation
- Quick Start
- Conversion Effects β Binary outcomes (signup, purchase, click)
- Magnitude Effects β Continuous metrics (revenue, time)
- Timing Effects β Survival analysis, event rates
- Sequential Testing β Early stopping with valid statistics
- Bayesian A/B Testing β Probability-based decisions
- Diagnostics β SRM detection, test health, novelty effects
- Planning β MDE calculator, duration recommendations
- Business Impact β Revenue projections, guardrails
- Segment Analysis β Analyze effects by user segment
- Generate Reports
- Web Interface
- API Reference
- Understanding Results β P-values, confidence intervals explained
- Best Practices
- License
| Traditional Tools | pyexpstats |
|---|---|
| "Which statistical test?" | "What changed in user behavior?" |
| Test-centric | Effect-centric |
| Complex statistics | Plain-language results |
pyexpstats models experimental impact across three fundamental outcome dimensions:
| Effect Type | Question Answered | Examples |
|---|---|---|
| Conversion | Whether something happens | Signup, purchase, click, trial start |
| Magnitude | How much it happens | Revenue, time spent, order value |
| Timing | When it happens | Time to purchase, time to churn |
pip install pyexpstatsRequirements: Python 3.8+
from pyexpstats import conversion, magnitude, timing
# Conversion: Did the treatment change whether users purchase?
result = conversion.analyze(
control_visitors=10000,
control_conversions=500,
variant_visitors=10000,
variant_conversions=600,
)
print(f"Conversion lift: {result.lift_percent:+.1f}%")
# Magnitude: Did the treatment change how much users spend?
result = magnitude.analyze(
control_visitors=5000,
control_mean=50.00,
control_std=25.00,
variant_visitors=5000,
variant_mean=52.50,
variant_std=25.00,
)
print(f"Revenue lift: ${result.lift_absolute:+.2f}")
# Timing: Did the treatment change when users convert?
result = timing.analyze(
control_times=[5, 8, 12, 15, 20],
control_events=[1, 1, 1, 0, 1],
treatment_times=[3, 6, 9, 12, 16],
treatment_events=[1, 1, 1, 1, 1],
)
print(f"Hazard ratio: {result.hazard_ratio:.2f}")Use for binary outcomes: did the user convert or not? Perfect for analyzing signup rates, purchase rates, click-through rates, and trial conversions.
from pyexpstats import conversion
result = conversion.analyze(
control_visitors=10000,
control_conversions=500, # 5.0% conversion
variant_visitors=10000,
variant_conversions=600, # 6.0% conversion
)
print(f"Control: {result.control_rate:.2%}")
print(f"Variant: {result.variant_rate:.2%}")
print(f"Lift: {result.lift_percent:+.1f}%")
print(f"Significant: {result.is_significant}")
print(f"Winner: {result.winner}")How many visitors do you need to detect a statistically significant difference?
plan = conversion.sample_size(
current_rate=5, # 5% baseline conversion rate
lift_percent=10, # detect 10% relative lift
confidence=95, # 95% confidence level
power=80, # 80% statistical power
)
print(f"Need {plan.visitors_per_variant:,} per variant")
plan.with_daily_traffic(10000)
print(f"Duration: {plan.test_duration_days} days")result = conversion.analyze_multi(
variants=[
{"name": "control", "visitors": 10000, "conversions": 500},
{"name": "variant_a", "visitors": 10000, "conversions": 550},
{"name": "variant_b", "visitors": 10000, "conversions": 600},
]
)
print(f"Best: {result.best_variant}")
print(f"P-value: {result.p_value:.4f}")Note: Variant names must be unique. Duplicate names will raise a ValueError.
result = conversion.diff_in_diff(
control_pre_visitors=5000, control_pre_conversions=250,
control_post_visitors=5000, control_post_conversions=275,
treatment_pre_visitors=5000, treatment_pre_conversions=250,
treatment_post_visitors=5000, treatment_post_conversions=350,
)
print(f"DiD effect: {result.diff_in_diff:+.2%}")Use for continuous metrics: revenue per user, average order value, time on site, pages per session.
from pyexpstats import magnitude
result = magnitude.analyze(
control_visitors=5000,
control_mean=50.00,
control_std=25.00,
variant_visitors=5000,
variant_mean=52.50,
variant_std=25.00,
)
print(f"Control: ${result.control_mean:.2f}")
print(f"Variant: ${result.variant_mean:.2f}")
print(f"Lift: ${result.lift_absolute:+.2f} ({result.lift_percent:+.1f}%)")
print(f"Significant: {result.is_significant}")plan = magnitude.sample_size(
current_mean=50, # $50 average order value
current_std=25, # $25 standard deviation
lift_percent=5, # detect 5% lift in AOV
)
print(f"Need {plan.visitors_per_variant:,} per variant")result = magnitude.analyze_multi(
variants=[
{"name": "control", "visitors": 1000, "mean": 50, "std": 25},
{"name": "new_layout", "visitors": 1000, "mean": 52, "std": 25},
{"name": "premium_upsell", "visitors": 1000, "mean": 55, "std": 25},
]
)
print(f"Best: {result.best_variant}")
print(f"F-statistic: {result.f_statistic:.2f}")Note: Variant names must be unique. Duplicate names will raise a ValueError.
result = magnitude.diff_in_diff(
control_pre_n=1000, control_pre_mean=50, control_pre_std=25,
control_post_n=1000, control_post_mean=51, control_post_std=25,
treatment_pre_n=1000, treatment_pre_mean=50, treatment_pre_std=25,
treatment_post_n=1000, treatment_post_mean=55, treatment_post_std=26,
)
print(f"DiD effect: ${result.diff_in_diff:+.2f}")Use for time-to-event analysis: time to purchase, time to churn, subscription duration, support ticket rates.
from pyexpstats import timing
result = timing.analyze(
control_times=[5, 8, 12, 15, 18, 22, 25, 30],
control_events=[1, 1, 1, 0, 1, 1, 0, 1], # 1=event, 0=censored
treatment_times=[3, 6, 9, 12, 14, 16, 20, 24],
treatment_events=[1, 1, 1, 1, 0, 1, 1, 1],
)
print(f"Control median time: {result.control_median_time}")
print(f"Treatment median time: {result.treatment_median_time}")
print(f"Hazard ratio: {result.hazard_ratio:.3f}")
print(f"Time saved: {result.time_saved:.1f} ({result.time_saved_percent:.1f}%)")
print(f"Significant: {result.is_significant}")curve = timing.survival_curve(
times=[5, 10, 15, 20, 25, 30],
events=[1, 1, 0, 1, 1, 0],
confidence=95,
)
print(f"Median survival time: {curve.median_time}")
print(f"Survival probabilities: {curve.survival_probabilities}")Compare event rates between groups (e.g., support tickets per day, errors per hour):
result = timing.analyze_rates(
control_events=45,
control_exposure=100, # 100 days of observation
treatment_events=38,
treatment_exposure=100,
)
print(f"Control rate: {result.control_rate:.4f} events/day")
print(f"Treatment rate: {result.treatment_rate:.4f} events/day")
print(f"Rate ratio: {result.rate_ratio:.3f}")
print(f"Rate change: {result.rate_difference_percent:+.1f}%")
print(f"Significant: {result.is_significant}")plan = timing.sample_size(
control_median=30, # Expected median for control
treatment_median=24, # Expected median for treatment
confidence=95,
power=80,
dropout_rate=0.1, # 10% expected dropout
)
print(f"Need {plan.subjects_per_group:,} per group")
print(f"Expected events: {plan.total_expected_events:,}")Stop your A/B tests early with valid statistics using Sequential Probability Ratio Test (SPRT) with O'Brien-Fleming boundaries.
from pyexpstats.methods import sequential
result = sequential.analyze(
control_visitors=2500,
control_conversions=125,
variant_visitors=2500,
variant_conversions=175,
expected_visitors_per_variant=5000, # Your planned sample size
)
print(f"Can stop: {result.can_stop}")
print(f"Decision: {result.decision}") # 'variant_wins', 'control_wins', 'no_difference', 'keep_running'
print(f"Progress: {result.information_fraction:.0%} through test")
print(f"Confidence: {result.confidence_variant_better:.1f}%")Why Sequential Testing?
- No peeking penalty β Check results as often as you want without inflating false positives
- Stop early for clear winners β Save time and traffic when effects are obvious
- Valid confidence intervals β Always maintain proper statistical guarantees
Get intuitive probability-based results instead of confusing p-values.
from pyexpstats.methods import bayesian
result = bayesian.analyze(
control_visitors=1000,
control_conversions=50,
variant_visitors=1000,
variant_conversions=65,
)
print(f"Probability variant is better: {result.probability_variant_better:.1f}%")
print(f"Expected loss if choosing variant: {result.expected_loss_choosing_variant:.4f}")
print(f"Lift credible interval: {result.lift_credible_interval}")
print(f"Winner: {result.winner}")Why Bayesian Testing?
- Intuitive results β "94% probability variant is better" vs "p < 0.05"
- No fixed sample size β Can check results anytime
- Risk quantification β Expected loss tells you the cost of being wrong
- Credible intervals β Direct probability statements about the true effect
Validate your A/B test before trusting the results.
SRM indicates bugs in your experiment setup that can invalidate results:
from pyexpstats.diagnostics import check_sample_ratio
result = check_sample_ratio(
control_visitors=10500,
variant_visitors=9500,
expected_ratio=0.5, # Expected 50/50 split
)
print(f"Valid: {result.is_valid}")
print(f"Severity: {result.severity}") # 'ok', 'warning', 'critical'
print(f"Deviation: {result.deviation_percent:.1f}%")Comprehensive health check for your experiment:
from pyexpstats.diagnostics import check_health
health = check_health(
control_visitors=5000,
control_conversions=250,
variant_visitors=5000,
variant_conversions=275,
)
print(f"Status: {health.overall_status}") # 'healthy', 'warning', 'unhealthy'
print(f"Score: {health.score}/100")
print(f"Can trust results: {health.can_trust_results}")
for check in health.checks:
print(f" {check.name}: {check.status}")Detect if your experiment effect is fading over time:
from pyexpstats.diagnostics import detect_novelty_effect
daily_results = [
{"day": 1, "control_visitors": 1000, "control_conversions": 50,
"variant_visitors": 1000, "variant_conversions": 70},
{"day": 2, "control_visitors": 1000, "control_conversions": 50,
"variant_visitors": 1000, "variant_conversions": 65},
# ... more days
]
result = detect_novelty_effect(daily_results)
print(f"Effect type: {result.effect_type}") # 'novelty', 'primacy', 'stable'
print(f"Initial lift: {result.initial_lift:+.1f}%")
print(f"Current lift: {result.current_lift:+.1f}%")
if result.projected_steady_state_lift:
print(f"Projected steady state: {result.projected_steady_state_lift:+.1f}%")Plan your A/B tests before running them.
Understand what effects you can detect with your traffic:
from pyexpstats.planning import minimum_detectable_effect
result = minimum_detectable_effect(
daily_traffic=5000,
test_duration_days=14,
baseline_rate=0.05,
)
print(f"MDE: {result.minimum_detectable_effect:.1f}% lift")
print(f"Can detect variant rate: {result.detectable_variant_rate:.2%}")
print(f"Is practically useful: {result.is_practically_useful}")Get recommendations for how long to run your test:
from pyexpstats.planning import recommend_duration
result = recommend_duration(
baseline_rate=0.05,
minimum_detectable_effect=0.10, # 10% lift
daily_traffic=5000,
business_type="ecommerce",
)
print(f"Recommended: {result.recommended_days} days")
print(f"Minimum: {result.minimum_days} days")
print(f"Ideal: {result.ideal_days} days")
print(f"Sample needed: {result.required_sample_per_variant:,} per variant")Translate A/B test results into business value.
from pyexpstats.business import project_impact
projection = project_impact(
control_rate=0.05,
variant_rate=0.055,
lift_percent=10.0,
lift_ci_lower=2.0,
lift_ci_upper=18.0,
monthly_visitors=100000,
revenue_per_conversion=50.0,
)
print(f"Monthly revenue lift: ${projection.monthly_revenue_lift:,.0f}")
print(f"Annual revenue lift: ${projection.annual_revenue_lift:,.0f}")
print(f"Probability of positive impact: {projection.probability_positive_impact:.1%}")Monitor metrics you want to protect during experiments:
from pyexpstats.business import check_guardrails
report = check_guardrails([
{
"name": "Page Load Time",
"metric_type": "mean",
"direction": "increase_is_bad",
"threshold_percent": 10,
"control_data": [100, 110, 95, 105] * 100,
"variant_data": [105, 115, 100, 108] * 100,
},
{
"name": "Error Rate",
"metric_type": "proportion",
"direction": "increase_is_bad",
"threshold_percent": 20,
"control_data": {"count": 50, "total": 10000},
"variant_data": {"count": 55, "total": 10000},
},
])
print(f"Can ship: {report.can_ship}")
print(f"Passed: {report.passed}")
print(f"Warnings: {report.warnings}")
print(f"Failures: {report.failures}")Analyze how your A/B test performs across different user segments.
from pyexpstats.segments import analyze_segments
report = analyze_segments([
{
"segment_name": "device",
"segment_value": "mobile",
"control_visitors": 5000,
"control_conversions": 250,
"variant_visitors": 5000,
"variant_conversions": 350,
},
{
"segment_name": "device",
"segment_value": "desktop",
"control_visitors": 3000,
"control_conversions": 180,
"variant_visitors": 3000,
"variant_conversions": 190,
},
])
print(f"Overall lift: {report.overall_lift:+.1f}%")
print(f"Best segment: {report.best_segment}")
print(f"Heterogeneity detected: {report.heterogeneity_detected}")
print(f"Simpson's paradox risk: {report.simpsons_paradox_risk}")
for segment in report.segments:
print(f" {segment.segment_value}: {segment.lift_percent:+.1f}% (sig: {segment.is_significant})")Features:
- Bonferroni/Holm correction β Automatic correction for multiple comparisons
- Heterogeneity detection β Find when effects vary significantly by segment
- Simpson's Paradox warnings β Detect when overall results mislead
Every effect type includes summarize() to generate plain-language markdown reports for stakeholders:
result = conversion.analyze(...)
report = conversion.summarize(result, test_name="Signup Button Test")
print(report)Output:
## π Signup Button Test Results
### β
Significant Result
**The test variant performed significantly higher than the control.**
- **Control conversion rate:** 5.00% (500 / 10,000)
- **Variant conversion rate:** 6.00% (600 / 10,000)
- **Relative lift:** +20.0% increase
- **P-value:** 0.0003
### π What This Means
With 95% confidence, the variant shows a **20.0%** improvement.pyexpstats includes a beautiful web UI for interactive experiment analysis:
pyexpstats-server
# Open http://localhost:8000Or use the hosted version at pyexpstats.vercel.app
Configure the API server using environment variables:
| Variable | Default | Description |
|---|---|---|
CORS_ORIGINS |
http://localhost:3000,http://localhost:5173 |
Comma-separated allowed origins |
For production, set appropriate CORS origins:
CORS_ORIGINS="https://yourdomain.com" pyexpstats-server| Tool | Description |
|---|---|
| Sample Size Calculator | Plan A/B tests with proper statistical power |
| A/B Test Significance Calculator | Analyze 2-variant and multi-variant experiments |
| Timing & Rate Analysis | Survival analysis and Poisson rate comparisons |
| Diff-in-Diff Calculator | Quasi-experimental causal inference |
| Confidence Interval Calculator | Estimate precision of your metrics |
The web interface includes:
- Visual metric type selection with examples (Conversion Rate vs Revenue)
- Helpful hints explaining statistical concepts
- Plain-language interpretations of p-values and confidence intervals
- Multi-variant testing with automatic Bonferroni correction
- Interactive visualizations of experiment results
| Function | Purpose |
|---|---|
sample_size(current_rate, lift_percent, ...) |
Sample size calculation for conversion tests |
analyze(control_visitors, control_conversions, ...) |
2-variant A/B test (Z-test) |
analyze_multi(variants, ...) |
Multi-variant test (Chi-square) |
diff_in_diff(...) |
Difference-in-Differences analysis |
confidence_interval(visitors, conversions, ...) |
Confidence interval for a conversion rate |
summarize(result, test_name) |
Generate markdown report |
| Function | Purpose |
|---|---|
sample_size(current_mean, current_std, lift_percent, ...) |
Sample size for continuous metrics |
analyze(control_visitors, control_mean, control_std, ...) |
2-variant test (Welch's t-test) |
analyze_multi(variants, ...) |
Multi-variant test (ANOVA) |
diff_in_diff(...) |
Difference-in-Differences analysis |
confidence_interval(visitors, mean, std, ...) |
Confidence interval for a mean |
summarize(result, test_name, metric_name, currency) |
Generate markdown report |
| Function | Purpose |
|---|---|
analyze(control_times, control_events, ...) |
Survival analysis (log-rank test) |
survival_curve(times, events, ...) |
Kaplan-Meier survival curve |
analyze_rates(control_events, control_exposure, ...) |
Poisson rate comparison |
sample_size(control_median, treatment_median, ...) |
Sample size for survival studies |
summarize(result, test_name) |
Generate markdown report |
summarize_rates(result, test_name, unit) |
Rate analysis report |
| Function | Purpose |
|---|---|
analyze(control_visitors, control_conversions, ..., expected_visitors_per_variant) |
Sequential test with early stopping |
summarize(result) |
Generate markdown report |
| Function | Purpose |
|---|---|
analyze(control_visitors, control_conversions, ...) |
Bayesian A/B test analysis |
summarize(result) |
Generate markdown report |
| Function | Purpose |
|---|---|
check_sample_ratio(control_visitors, variant_visitors, ...) |
SRM detection |
check_health(control_visitors, control_conversions, ...) |
Comprehensive test health check |
detect_novelty_effect(daily_results, ...) |
Detect fading/growing effects |
| Function | Purpose |
|---|---|
minimum_detectable_effect(sample_size_per_variant, ...) |
Calculate MDE |
recommend_duration(baseline_rate, minimum_detectable_effect, daily_traffic, ...) |
Duration recommendations |
| Function | Purpose |
|---|---|
project_impact(control_rate, variant_rate, lift_percent, ...) |
Revenue impact projection |
check_guardrails(guardrails) |
Monitor guardrail metrics |
| Function | Purpose |
|---|---|
analyze_segments(segments_data, ...) |
Segment-level analysis with correction |
pyexpstats/
effects/
outcome/
conversion.py # Binary outcomes (signup, purchase, click)
magnitude.py # Continuous metrics (revenue, time, value)
timing.py # Time-to-event (survival, rates)
methods/
sequential.py # Sequential testing with early stopping
bayesian.py # Bayesian A/B testing
diagnostics/
srm.py # Sample Ratio Mismatch detection
health.py # Test health dashboard
novelty.py # Novelty effect detection
planning/
mde.py # Minimum Detectable Effect calculator
duration.py # Test duration recommendations
business/
impact.py # Revenue impact projections
guardrails.py # Guardrail metrics monitoring
segments/
analysis.py # Segment-level analysis
| P-value | Interpretation |
|---|---|
| < 0.01 | Very strong evidence (highly significant) |
| 0.01 - 0.05 | Strong evidence (statistically significant at 95%) |
| 0.05 - 0.10 | Weak evidence (marginally significant) |
| > 0.10 | Not enough evidence (not significant) |
A 95% confidence interval means: if you ran this experiment 100 times, about 95 of those intervals would contain the true effect.
| Hazard Ratio | Interpretation |
|---|---|
| HR < 1 | Treatment slows events (protective effect) |
| HR = 1 | No effect on timing |
| HR > 1 | Treatment speeds up events |
| Rate Ratio | Interpretation |
|---|---|
| RR < 1 | Treatment reduces event rate |
| RR = 1 | No effect on rate |
| RR > 1 | Treatment increases event rate |
- Calculate sample size BEFORE starting β Don't peek and stop early (p-hacking)
- Run for at least 1-2 full weeks β Capture day-of-week and seasonal patterns
- Look at confidence intervals β Not just p-values
- Statistical significance β business significance β A 0.1% lift might be "significant" but not worth implementing
- Use Bonferroni correction β For multi-variant tests (automatic in
analyze_multi) - Consider timing effects β A treatment might speed up conversion without changing the overall rate
pyexpstats is used for:
- Conversion Rate Optimization (CRO) β Optimize landing pages, signup flows, checkout
- Product Experimentation β Test new features, UI changes, pricing
- Growth Hacking β Validate acquisition and retention strategies
- Marketing Analytics β Email campaigns, ad creative testing
- E-commerce Optimization β Product recommendations, pricing tests
- SaaS Metrics β Trial conversion, churn reduction, upsell tests
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License β free for commercial and personal use.
Inspired by Evan Miller's A/B Testing Tools.
A/B testing Python, A/B test calculator Python, split testing library, Python experiment analysis, statistical significance calculator Python, sample size calculator Python, conversion rate optimization Python, CRO tool Python, hypothesis testing Python library, p-value calculator, confidence interval calculator, statistical power analysis Python, experiment design tool, Bayesian A/B testing Python, sequential testing Python, chi-square test Python, Welch's t-test Python, Z-test Python, ANOVA Python, survival analysis Python, Kaplan-Meier Python, log-rank test, Poisson test, difference-in-differences Python, causal inference Python, Python statistics library, product analytics Python, experimentation platform Python, growth experimentation, marketing A/B test, e-commerce A/B testing, web analytics Python, multi-armed bandit alternative, online controlled experiment, randomized controlled trial analysis, uplift modeling, treatment effect estimation, experiment significance, Python data science A/B test, conversion rate calculator, revenue impact analysis, SRM detection, sample ratio mismatch, novelty effect detection, guardrail metrics.

