Claude Prompt for A/B Test Result Interpretation and Next Steps

Q: When should I kill a test vs. extend it?

Kill it if it's been running four or more weeks without reaching statistical significance. At that point, the variant probably isn't meaningfully different from the control. Extend it if you're at week two or three with directional results and the traffic is insufficient to call significance. Don't run tests indefinitely chasing significance.

Q: What's the secondary metric check and why does it matter?

It checks whether the test variant improved the primary metric at the cost of a secondary one. The most common failure mode: form completions go up but downstream MQL conversion goes down — the variant attracted more leads but lower-quality leads. That's not a real win; it's a CPL optimization that destroys pipeline economics. The secondary metric check flags this before you deploy a winner that makes your CPL look better while making your pipeline worse.

Q: What are autogrowth mutations and how do I use the next-test recommendations?

Autogrowth mutations are minor variations on the winning variant — single sub-element adjustments designed to incrementally push the conversion rate higher over successive weekly test cycles. The prompt gives you three to five specific mutations when the variant wins. Each mutation tests one change against the new control. The winning mutation becomes the new control, and the cycle repeats — producing compounding gains over four to six weeks rather than one big lift.

A/B test result interpretation: Interpret A/B test results by pasting variant traffic, conversion events, duration, and test context into Claude with this prompt. It delivers a statistical verdict, a practical business-terms verdict, a secondary metric check for inverse effects, and a next-step roadmap — including three to five weekly mutation recommendations when the variant wins, or a hypothesis audit when it loses.

The Prompt

Production Prompt — Copy and use verbatim

You are a senior B2B SaaS conversion optimization specialist interpreting A/B test results to decide what to do next. You know that the most common mistakes in A/B testing are calling winners too early on noisy data, dismissing wins as noise when they're real, and failing to learn from losers — and that all three lead to wasted test cycles.

INPUTS

I will paste the test results below. Required fields: variant name, traffic / sessions, conversion events, conversion rate, test duration, primary metric.

{PASTE_TEST_RESULTS_HERE}

{PASTE_TEST_CONTEXT_HERE}
(Example: "Test ran on our primary landing page hero copy. Variant B tested a pain-led headline vs control's generic value prop. Goal: increase form completion rate.")

{OPTIONAL_PASTE_SECONDARY_METRICS_HERE}
(Bounce rate, scroll depth, time on page, downstream MQL conversion — leave blank if not available.)

WHAT I NEED FROM YOU

Interpret the results and produce the output in this exact order:

1. Statistical Verdict
- Winner / Loser / Inconclusive
- Lift (or drop): variant vs. control, in absolute and percentage terms
- Significance: whether the result is statistically meaningful given the sample size. Flag if traffic is insufficient to draw a conclusion.

2. Practical Verdict
A statistical winner is not always a practical winner. State whether the lift is meaningful in business terms — a 3% lift on a 1% conversion rate is only 0.03 absolute points, which may not justify the deployment effort.

3. Secondary Metric Check
If secondary metrics were provided: did the test variant improve the primary metric at the cost of a secondary one? (Example: form completions up, but downstream MQL conversion down — the test variant attracted lower-quality leads.) Flag any inverse relationship.

4. What to Test Next
Based on the result:
- If variant won: recommend 3-5 mutations of the winner to test next week. Each mutation should adjust a single sub-element (specific word, layout shift, CTA tweak) — not a wholesale rewrite. This is the autogrowth phase: the winner becomes the new control, and weekly mutations compound the gains.
- If variant lost: recommend what the loss tells us about the hypothesis. Was the hypothesis wrong, or was the execution wrong? A "wrong execution" loss is different from a "wrong hypothesis" loss — the first calls for re-testing; the second calls for a different hypothesis.
- If inconclusive: recommend whether to extend the test (more traffic), kill it (move to a higher-priority test), or rework the variant (the difference between control and treatment may have been too subtle to produce a detectable effect).

5. Plateau Check
If you have access to recent test history (provided in context or known from prior runs), state whether the current conversion rate is approaching a plateau. Autogrowth compounding produces diminishing returns over time; recognize when to stop iterating on a page and move attention to a different one.

JUDGMENT RULES

- Statistical significance is a threshold, not a verdict. A result that reaches 95% confidence on a small sample size is often a result that won't replicate. Treat sample size and confidence together.
- Practical significance is harder to call than statistical. Build the practical case in business terms, not statistical terms — "this lift would generate an additional 30 MQLs per month at current traffic" is a clearer call than "this lift is statistically significant at p=0.04."
- A "winner" that improves the primary metric but degrades a secondary metric is often not a real winner. Always check the secondary metric direction.
- A "loser" is information, not a failure. Document what the loss teaches about the hypothesis before moving on.
- Do not recommend continuing a test indefinitely chasing significance. If a test has run 4+ weeks without reaching significance, the variant probably isn't meaningfully different from the control — kill it and move on.
- If you don't have enough data to call a verdict, say so. "Sample size is too small to call this test — recommend extending another 2 weeks" is the right answer when the data is thin.

OUTPUT FORMAT

Return as {OUTPUT_FORMAT}.

If "markdown": verdict, practical assessment, secondary check, next-test recommendations, plateau check.
If "html": styled report with verdict and next-test recommendations clearly delineated.

Begin.

How to Use It

This prompt solves three recurring mistakes in A/B testing: calling winners too early on noisy data, dismissing real wins as noise, and failing to learn from losers. Claude handles the statistical and practical verdict distinction better than ChatGPT in production — Claude is more precise about when sample size is too small to call a test ("this result is not statistically meaningful — extend two weeks or kill and redirect traffic") and more careful about the secondary metric check (a form completion lift that comes with a downstream MQL degradation is not a real win). ChatGPT tends toward false confidence when data is thin.

The autogrowth framing is the most distinctive part of this prompt. When a variant wins, the output doesn't just say "deploy the winner" — it recommends three to five mutations of the winner to test the following week. Each mutation adjusts a single sub-element (a specific word, a layout shift, a CTA tweak) rather than a wholesale rewrite. That's the autogrowth phase: the winner becomes the new control, and weekly mutations compound the gains over time. The benchmark from the prompt's PRD source is pages moving from 60→85 quickly via initial testing, then from 85→95 slowly via autogrowth compounding over weeks.

The plateau check at the end is the safeguard against over-optimizing a single page. When the autogrowth compound gains start producing smaller lifts, the model flags that the page may be approaching its optimization ceiling — and recommends shifting attention to a higher-leverage page rather than continuing to test micro-variants that won't compound meaningfully. That plateau recognition is what distinguishes a testing program that learns from one that just runs tests.

Example Output

Live Example

Test: Enterprise LP — Pain-led headline vs. value-prop control
Duration: 16 days
Variant A (control): "The Website Experience Platform" — 4,140 sessions, 78 form completions (1.88% CVR)
Variant B (treatment): "Launch enterprise landing pages 4x faster — without writing code." — 4,108 sessions, 102 form completions (2.48% CVR)

Statistical Verdict
Winner: Variant B.
Lift: +0.60 absolute conversion points, +32% relative lift.
Significance: Reaches 95% confidence at p=0.02. Sample size is adequate (>3,000 sessions per variant) — this is not a small-sample artifact.

Practical Verdict
Strong practical winner. The 0.60 point lift, projected against current monthly enterprise LP traffic of ~17,000 sessions, would generate approximately 100 additional form completions per month. At current MQL pull-through, that's roughly 40-50 incremental MQLs/month. Worth deploying immediately.

Secondary Metric Check
- Bounce rate: dropped from 62% to 51% on Variant B (significant improvement, supports the headline win)
- Downstream MQL pull-through: held at 41% across both variants (treatment did not attract lower-quality leads)
- Time on page (above the fold): up 22 seconds on Variant B (visitors engaged longer with the pain-led message)

No inverse relationships flagged. Variant B is a clean win across primary and secondary metrics.

What to Test Next (Autogrowth Mutations)

1. Number specificity: Replace "4x faster" with "from 6 weeks to 6 days" (autoresearch panel scored this variant comparably; worth a live read)
2. CTA tweak: Test "Book My Enterprise Demo" vs. current "Get Started"
3. Social proof position: Add the existing logo bar directly under the headline vs. its current position
4. Subhead refinement: Test current subhead vs. one that names a specific competitor frustration

Each mutation should adjust a single element. Variant B is the new control; mutations are tested against it weekly.

Plateau Check
Not yet at plateau. Enterprise LP has only had one major copy test in the last 90 days. Recommend running autogrowth weekly for the next 6-8 weeks before evaluating whether the page has plateaued at its new conversion rate.

Illustrative example based on the PRD's autoresearch/autogrowth framework. Will be replaced with a redacted live result interpretation from an upcoming test cycle.

Common Failure Modes

Calling a winner on thin data. First version declared a winner on 600 sessions because the variant hit 95% statistical confidence. 600 sessions at sub-2% CVR is not adequate — small-sample wins often fail to replicate. The judgment rule now treats sample size and confidence together; the model is forced to flag when the sample is too thin to trust the verdict.
Missing the secondary-metric degradation. A test "win" that lifts form completions by 25% but tanks downstream MQL conversion by 30% is not a real win — you just attracted lower-quality leads. Without the secondary metric check, the model would have shipped the variant with confidence and broken the funnel downstream. Always paste secondary metrics in the inputs when you have them.
Mistaking "no significance" for "no signal." A test that runs 4 weeks and reaches only p=0.08 is not a "loser" — it's possibly real, just not yet measurable at current traffic. But the judgment rule also enforces the opposite: if a test has been running 4+ weeks with no significance, the variant is probably not meaningfully different from control. Kill it and test something more aggressive.

Variations

Two variations of this prompt are worth knowing.

Variation 1: Multivariate Test Version

Adapted for interpreting multivariate test results where more than two elements were tested simultaneously — producing interaction effect analysis alongside the individual element verdicts and a recommendation for which winning combinations to carry forward.

Coming soon


[PROMPT GOES HERE]

Variation 2: Early Stop Decision Version

Scoped specifically to the early-stop decision — when a test is mid-flight and showing strong directional results, should you call it early or run to statistical significance? Frames the business risk of calling early vs. the cost of continuing a test that appears to have a clear winner.

Coming soon


[PROMPT GOES HERE]

Get one new prompt every Monday.

Plus the system behind it. Free. Built for in-house demand gen managers at B2B SaaS companies.

Frequently Asked Questions

What's the difference between statistical significance and practical significance?

Statistical significance means the result is unlikely to be explained by chance — typically at 95% confidence (p < 0.05). Practical significance means the lift is meaningful in business terms. A test can be statistically significant with a lift that's too small to matter: a 3% lift on a 1% conversion rate is 0.03 absolute points — statistically real but possibly not worth the deployment effort. The prompt evaluates both and flags cases where statistical significance doesn't translate into a meaningful business outcome. The key question is: "What does this lift mean in MQLs or pipeline per month at current traffic?"

When should I kill a test vs. extend it?

Kill it if it's been running four or more weeks without reaching statistical significance. At that point, the variant probably isn't meaningfully different from the control — the effect size is too small to detect at your traffic volumes, which usually means the effect size is too small to matter. Extend it if you're at week two or three with directional results and the traffic is insufficient to call significance — more time is likely to produce a clear result. The prompt makes this call explicitly; don't run tests indefinitely chasing significance.

What's the "secondary metric check" and why does it matter?

It checks whether the test variant improved the primary metric at the cost of a secondary one. The most common failure mode: form completions go up but downstream MQL conversion goes down — the variant attracted more leads but lower-quality leads. That's not a real win; it's a CPL optimization that destroys pipeline economics. The secondary metric check flags this before you deploy a "winner" that makes your CPL look better while making your pipeline worse.

What are "autogrowth mutations" and how do I use the next-test recommendations?

Autogrowth mutations are minor variations on the winning variant — single sub-element adjustments designed to incrementally push the conversion rate higher over successive weekly test cycles. The prompt gives you three to five specific mutations when the variant wins (e.g., "change 'Book a Demo' to 'See It Live' in the CTA," or "move the social proof block above the form"). Each mutation tests one change against the new control. The winning mutation becomes the new control, and the cycle repeats — producing compounding gains over four to six weeks rather than one big lift.