The Prompt
You are a senior B2B SaaS conversion optimization specialist interpreting A/B test results to decide what to do next. You know that the most common mistakes in A/B testing are calling winners too early on noisy data, dismissing wins as noise when they're real, and failing to learn from losers — and that all three lead to wasted test cycles.
INPUTS
I will paste the test results below. Required fields: variant name, traffic / sessions, conversion events, conversion rate, test duration, primary metric.
{PASTE_TEST_RESULTS_HERE}
{PASTE_TEST_CONTEXT_HERE}
(Example: "Test ran on our primary landing page hero copy. Variant B tested a pain-led headline vs control's generic value prop. Goal: increase form completion rate.")
{OPTIONAL_PASTE_SECONDARY_METRICS_HERE}
(Bounce rate, scroll depth, time on page, downstream MQL conversion — leave blank if not available.)
WHAT I NEED FROM YOU
Interpret the results and produce the output in this exact order:
1. Statistical Verdict
- Winner / Loser / Inconclusive
- Lift (or drop): variant vs. control, in absolute and percentage terms
- Significance: whether the result is statistically meaningful given the sample size. Flag if traffic is insufficient to draw a conclusion.
2. Practical Verdict
A statistical winner is not always a practical winner. State whether the lift is meaningful in business terms — a 3% lift on a 1% conversion rate is only 0.03 absolute points, which may not justify the deployment effort.
3. Secondary Metric Check
If secondary metrics were provided: did the test variant improve the primary metric at the cost of a secondary one? (Example: form completions up, but downstream MQL conversion down — the test variant attracted lower-quality leads.) Flag any inverse relationship.
4. What to Test Next
Based on the result:
- If variant won: recommend 3-5 mutations of the winner to test next week. Each mutation should adjust a single sub-element (specific word, layout shift, CTA tweak) — not a wholesale rewrite. This is the autogrowth phase: the winner becomes the new control, and weekly mutations compound the gains.
- If variant lost: recommend what the loss tells us about the hypothesis. Was the hypothesis wrong, or was the execution wrong? A "wrong execution" loss is different from a "wrong hypothesis" loss — the first calls for re-testing; the second calls for a different hypothesis.
- If inconclusive: recommend whether to extend the test (more traffic), kill it (move to a higher-priority test), or rework the variant (the difference between control and treatment may have been too subtle to produce a detectable effect).
5. Plateau Check
If you have access to recent test history (provided in context or known from prior runs), state whether the current conversion rate is approaching a plateau. Autogrowth compounding produces diminishing returns over time; recognize when to stop iterating on a page and move attention to a different one.
JUDGMENT RULES
- Statistical significance is a threshold, not a verdict. A result that reaches 95% confidence on a small sample size is often a result that won't replicate. Treat sample size and confidence together.
- Practical significance is harder to call than statistical. Build the practical case in business terms, not statistical terms — "this lift would generate an additional 30 MQLs per month at current traffic" is a clearer call than "this lift is statistically significant at p=0.04."
- A "winner" that improves the primary metric but degrades a secondary metric is often not a real winner. Always check the secondary metric direction.
- A "loser" is information, not a failure. Document what the loss teaches about the hypothesis before moving on.
- Do not recommend continuing a test indefinitely chasing significance. If a test has run 4+ weeks without reaching significance, the variant probably isn't meaningfully different from the control — kill it and move on.
- If you don't have enough data to call a verdict, say so. "Sample size is too small to call this test — recommend extending another 2 weeks" is the right answer when the data is thin.
OUTPUT FORMAT
Return as {OUTPUT_FORMAT}.
If "markdown": verdict, practical assessment, secondary check, next-test recommendations, plateau check.
If "html": styled report with verdict and next-test recommendations clearly delineated.
Begin.
How to Use It
This prompt solves three recurring mistakes in A/B testing: calling winners too early on noisy data, dismissing real wins as noise, and failing to learn from losers. Claude handles the statistical and practical verdict distinction better than ChatGPT in production — Claude is more precise about when sample size is too small to call a test ("this result is not statistically meaningful — extend two weeks or kill and redirect traffic") and more careful about the secondary metric check (a form completion lift that comes with a downstream MQL degradation is not a real win). ChatGPT tends toward false confidence when data is thin.
The autogrowth framing is the most distinctive part of this prompt. When a variant wins, the output doesn't just say "deploy the winner" — it recommends three to five mutations of the winner to test the following week. Each mutation adjusts a single sub-element (a specific word, a layout shift, a CTA tweak) rather than a wholesale rewrite. That's the autogrowth phase: the winner becomes the new control, and weekly mutations compound the gains over time. The benchmark from the prompt's PRD source is pages moving from 60→85 quickly via initial testing, then from 85→95 slowly via autogrowth compounding over weeks.
The plateau check at the end is the safeguard against over-optimizing a single page. When the autogrowth compound gains start producing smaller lifts, the model flags that the page may be approaching its optimization ceiling — and recommends shifting attention to a higher-leverage page rather than continuing to test micro-variants that won't compound meaningfully. That plateau recognition is what distinguishes a testing program that learns from one that just runs tests.
Example Output
Example output coming soon — currently running this prompt against live data and will publish the redacted output once it's ready.
Common Failure Modes
Variations
Two variations of this prompt are worth knowing.
Variation 1: Multivariate Test Version
Adapted for interpreting multivariate test results where more than two elements were tested simultaneously — producing interaction effect analysis alongside the individual element verdicts and a recommendation for which winning combinations to carry forward.
[PROMPT GOES HERE]
Variation 2: Early Stop Decision Version
Scoped specifically to the early-stop decision — when a test is mid-flight and showing strong directional results, should you call it early or run to statistical significance? Frames the business risk of calling early vs. the cost of continuing a test that appears to have a clear winner.
[PROMPT GOES HERE]
Get one new prompt every Monday.
Plus the system behind it. Free. Built for in-house demand gen managers at B2B SaaS companies.
Subscribe free →Frequently Asked Questions
What's the difference between statistical significance and practical significance?
Statistical significance means the result is unlikely to be explained by chance — typically at 95% confidence (p < 0.05). Practical significance means the lift is meaningful in business terms. A test can be statistically significant with a lift that's too small to matter: a 3% lift on a 1% conversion rate is 0.03 absolute points — statistically real but possibly not worth the deployment effort. The prompt evaluates both and flags cases where statistical significance doesn't translate into a meaningful business outcome. The key question is: "What does this lift mean in MQLs or pipeline per month at current traffic?"
When should I kill a test vs. extend it?
Kill it if it's been running four or more weeks without reaching statistical significance. At that point, the variant probably isn't meaningfully different from the control — the effect size is too small to detect at your traffic volumes, which usually means the effect size is too small to matter. Extend it if you're at week two or three with directional results and the traffic is insufficient to call significance — more time is likely to produce a clear result. The prompt makes this call explicitly; don't run tests indefinitely chasing significance.
What's the "secondary metric check" and why does it matter?
It checks whether the test variant improved the primary metric at the cost of a secondary one. The most common failure mode: form completions go up but downstream MQL conversion goes down — the variant attracted more leads but lower-quality leads. That's not a real win; it's a CPL optimization that destroys pipeline economics. The secondary metric check flags this before you deploy a "winner" that makes your CPL look better while making your pipeline worse.
What are "autogrowth mutations" and how do I use the next-test recommendations?
Autogrowth mutations are minor variations on the winning variant — single sub-element adjustments designed to incrementally push the conversion rate higher over successive weekly test cycles. The prompt gives you three to five specific mutations when the variant wins (e.g., "change 'Book a Demo' to 'See It Live' in the CTA," or "move the social proof block above the form"). Each mutation tests one change against the new control. The winning mutation becomes the new control, and the cycle repeats — producing compounding gains over four to six weeks rather than one big lift.