Claude Prompt for A/B Test Hypothesis Generation

Q: How specific should my current state input be?

The more specific, the better the hypotheses. 'Enterprise landing page is converting at 1.8%. Hero headline reads The Website Experience Platform. Bounce rate is 62% for senior visitors from paid traffic' gives the model enough signal to generate hypotheses with real causal reasoning. 'Landing page performance isn't great' produces generic hypotheses with low actionability. Include your current conversion rate, the specific copy or layout you're testing against, any behavioral data you have, and any qualitative feedback from the panel or from sales.

A/B test hypothesis generation: Generate five structured A/B test hypotheses by describing your current page state, the observed problem, and any constraints to Claude with this prompt. Each hypothesis targets a different conversion lever, is scored on Impact × Confidence × Ease, and returns a complete test design for the top-ranked hypothesis — including success metric, sample size estimate, and pre-launch validation recommendation.

The Prompt

Production Prompt — Copy and use verbatim

You are a senior B2B SaaS conversion optimization specialist generating A/B test hypotheses for paid landing pages and ad creative. You know that hypothesis quality determines test outcomes: vague hypotheses produce inconclusive results, while specific hypotheses produce wins or losses you can build on.

INPUTS

I will paste the current state below — what's running, what the performance is, and what change is being considered.

{PASTE_CURRENT_STATE_HERE}
(Example: "Enterprise landing page is converting at 1.8%. Hero headline reads 'The Website Experience Platform.' Bounce rate is 62%.")

{PASTE_OBSERVED_PROBLEM_HERE}
(Example: "Senior visitors are bouncing without scrolling past the hero. The headline tested poorly with our panel.")

{OPTIONAL_PASTE_CONSTRAINTS_HERE}
(Examples: "Cannot change form fields," "Brand team approval required for hero imagery." Leave blank if no constraints.)

WHAT I NEED FROM YOU

Generate 5 prioritized A/B test hypotheses. Produce the output in this exact order:

1. Hypothesis List
5 distinct hypotheses, each in the format: "If we change [variable], we expect [outcome] because [reasoning]." Each hypothesis should target a different lever:

- Hero headline / value prop
- CTA copy or placement
- Social proof / trust signals
- Form fields or form flow
- Visual hierarchy or above-the-fold layout

2. Prioritization
Score each hypothesis on three dimensions (1-5 scale):
- Impact: how much could this move the conversion rate if it wins?
- Confidence: how strong is the evidence that this is the right lever?
- Ease: how hard is this to test (engineering effort, design effort, approvals required)?

Calculate the priority score: Impact × Confidence × Ease. Rank from highest to lowest.

3. Test Design for Top Hypothesis
For the highest-priority hypothesis:
- Variant A (control): describe
- Variant B (treatment): describe
- Primary success metric: state explicitly (form completion rate, MQL pull-through, etc.)
- Secondary metrics to monitor (bounce rate, scroll depth, etc.)
- Required sample size: estimate based on current traffic and expected lift
- Estimated test duration

4. Pre-Launch Validation Recommendation
Recommend whether this hypothesis is strong enough to deploy to live traffic, or whether it should first go through autoresearch panel scoring (see /prompts/ad-copy-variant-generator/) to validate the copy variants before allocating paid traffic.

JUDGMENT RULES

- A good hypothesis is specific. "Improve the headline" is not a hypothesis. "Replace the current generic value-prop headline with one that names a specific buyer pain to reduce senior-level bounce" is a hypothesis.
- Hypotheses must tie to a primary metric, not just a directional improvement. "Increase trust" is not a measurable outcome; "Increase form completion rate by 15%" is.
- Confidence comes from evidence: heat maps, session recordings, qualitative feedback, panel scoring results, competitor analysis. If a hypothesis has no evidence beyond opinion, mark its Confidence score low.
- Ease is not the same as engineering time. A copy change might be "easy" technically but require Brand team approval, which lowers Ease in practice. Account for soft constraints.
- Impact × Confidence × Ease prioritizes for compounding wins. A high-impact, high-confidence, hard-to-build test still might not be the right next test if a high-impact, high-confidence, easy-to-build test is also on the list.
- If you don't have enough context to evaluate a hypothesis (especially Impact and Confidence), say so. Do not assign arbitrary scores to fill in the table.

OUTPUT FORMAT

Return as {OUTPUT_FORMAT}.

If "markdown": hypothesis list, prioritization table, test design for top hypothesis, recommendation.
If "html": styled report with the prioritization table prominently displayed and the top hypothesis detailed.

Begin.

How to Use It

This prompt is built for the question that follows every conversion rate problem: "what should we test first?" The Impact × Confidence × Ease prioritization framework comes from real experimentation program PRDs — it's designed to surface the tests that have the highest likelihood of producing a meaningful, replicable result, not just the tests that are easiest to run or feel intuitively right. Claude applies this framework more precisely than ChatGPT in production use — Claude is more disciplined about flagging low-confidence hypotheses ("there's no data to support this beyond opinion") rather than assigning high confidence scores across the board.

Each hypothesis follows the format "If we change [variable], we expect [outcome] because [reasoning]" — the "because" is the critical part. A hypothesis without a causal mechanism isn't a hypothesis; it's a guess. The five hypotheses in the output each target a different conversion lever: hero headline, CTA copy or placement, social proof and trust signals, form fields or form flow, and visual hierarchy. Covering all five levers with one run of the prompt gives you a ranked backlog, not just one test idea.

The pre-launch validation recommendation at the end is the most-overlooked part of the output. If the top hypothesis involves copy variants, the prompt recommends running those variants through the Ad Copy Variant Generator's expert panel before allocating live traffic — that pre-validation step can eliminate weak variants in minutes rather than after 14 days of test traffic. If the hypothesis involves layout or flow changes, it goes directly to live testing since panel scoring isn't calibrated for structural changes.

Example Output

Live Example

Page: Enterprise landing page
Current state: 1.8% form completion rate, 62% bounce rate, hero headline "The Website Experience Platform"
Observed problem: Senior visitors bouncing without scrolling past the hero; headline scored 64 on autoresearch panel (below 85 deploy threshold)

5 Hypotheses

1. (Hero headline) If we replace the generic value-prop headline with a pain-led headline that names enterprise marketing bottlenecks, we expect form completion to increase by 20-30% because the panel scoring showed pain-led variants outperformed value-prop variants by an average of 18 points.

2. (CTA copy) If we change the CTA from "Get Started" to "Book My Enterprise Demo," we expect form completion to increase by 10-15% because the CTA specificity effect is approximately 14 points on the panel scale.

3. (Social proof) If we add a logo bar of 6 named enterprise customers above the fold, we expect bounce rate to decrease by 10-15% because heat-map data on similar pages shows logo bars retain senior visitors during the first scroll.

4. (Form fields) If we move the phone number field from position 2 to position 5 in the form, we expect form completion to increase by 20-30% because high-friction fields early in a form depress completion by a measured 20-30 points.

5. (Above-the-fold layout) If we tighten the hero to a single CTA above the fold (currently has two competing CTAs), we expect form completion to increase by 5-10% because dual-CTA designs split visitor attention.

Prioritization

| # | Hypothesis                  | Impact | Conf | Ease | Score |
|---|-----------------------------|--------|------|------|-------|
| 1 | Hero headline (pain-led)    | 5      | 4    | 4    | 80    |
| 4 | Move phone to position 5    | 4      | 5    | 5    | 100   |
| 2 | CTA specificity change      | 3      | 5    | 5    | 75    |
| 3 | Logo bar above fold         | 3      | 3    | 4    | 36    |
| 5 | Single CTA above fold       | 2      | 4    | 4    | 32    |

Ranked top to bottom by Impact × Confidence × Ease.

Test Design for Top Hypothesis (#4 — Form Field Order)

- Variant A (control): Phone number at position 2 of form
- Variant B (treatment): Phone number at position 5 of form
- Primary success metric: Form completion rate
- Secondary metrics: Phone field abandonment rate, downstream MQL pull-through (verify treatment doesn't degrade lead quality by attracting form-completers who would have abandoned)
- Required sample size: ~1,600 visitors per variant (assumes 1.8% baseline, +20% expected lift, 95% confidence)
- Estimated test duration: 14 days at current traffic

Pre-Launch Validation Recommendation
This hypothesis does not require autoresearch panel scoring — the change is a structural/layout test, not a copy test. Deploy directly to a 50/50 split. If the test wins, follow up with autoresearch on the hero headline (hypothesis #1) which is copy-dependent.

Illustrative example based on the PRD's documented field-order friction effect. Will be replaced with a redacted live hypothesis brief.

Common Failure Modes

Vague hypotheses dressed up as specific. First version produced "If we improve the headline, we expect higher conversion because the headline is the most important element on the page." That's a guess, not a hypothesis. The "Replace [specific thing] with [specific thing] to drive [specific outcome] because [specific evidence]" structure is enforced now, but if your input is vague, the output will drift back to vague.
Confidence inflation without evidence. The model wanted to assign 5/5 confidence to every hypothesis based on "best practices." Best practices aren't evidence. The judgment rule requires explicit evidence (heat maps, session recordings, panel scoring, competitor analysis) before a 4 or 5 confidence score. If the hypothesis is pure opinion, the score caps at 2.
Ignoring soft constraints. A "hero imagery" hypothesis might be 5/5 on impact and confidence but 1/5 on ease if Brand team approval is required for any hero change. The model defaults to engineering time as the only ease dimension; the judgment rule now forces it to account for approvals, dependencies, and stakeholder politics.

Variations

Two variations of this prompt are worth knowing.

Variation 1: Ad Creative Hypothesis Version

Adapted for generating hypotheses about ad creative variables rather than landing page elements — testing ad headline angles, visual formats, CTA specificity, and audience-signal alignment. Uses the same Impact × Confidence × Ease framework scoped to paid creative decisions.

Coming soon


[PROMPT GOES HERE]

Variation 2: Micro-Hypothesis Version (Single Element)

Narrowed to a single element (e.g., just the CTA button copy or just the form placement) when the team has already identified the lever and needs five distinct hypotheses for that specific variable rather than five hypotheses across different levers.

Coming soon


[PROMPT GOES HERE]

Get one new prompt every Monday.

Plus the system behind it. Free. Built for in-house demand gen managers at B2B SaaS companies.

Frequently Asked Questions

What's the difference between Impact, Confidence, and Ease in the scoring?

Impact is how much the change could move the primary metric if the hypothesis is correct — a hero headline change on a high-traffic page has higher Impact than a CTA tweak on a low-traffic page. Confidence is how strong the evidence is that this is the right lever — a hypothesis backed by heat map data and session recordings has higher Confidence than one based on opinion. Ease is how hard the test is to implement in practice, including soft constraints like Brand approval or engineering dependency. The three scores multiply, so a high-impact, high-confidence, hard test (5×5×1) scores lower than a high-impact, moderate-confidence, easy test (5×3×4).

How specific should my "current state" input be?

The more specific, the better the hypotheses. "Enterprise landing page is converting at 1.8%. Hero headline reads 'The Website Experience Platform.' Bounce rate is 62% for senior visitors from paid traffic" gives the model enough signal to generate hypotheses with real causal reasoning. "Landing page performance isn't great" produces generic hypotheses with low actionability. Include your current conversion rate, the specific copy or layout you're testing against, any behavioral data you have (heat maps, scroll depth, session recordings), and any qualitative feedback from the panel or from sales.

Can I use this prompt if I don't have heat map or session recording data?

Yes, but the Confidence scores will be lower for all hypotheses, which is the correct output. A hypothesis based only on intuition ("I think the headline is too generic") has lower Confidence than one backed by data ("72% of senior visitors from LinkedIn bounce without scrolling past the hero, and the panel scored the headline 52/100 on the Skeptical Buyer dimension"). The prompt marks low-confidence hypotheses explicitly — that's useful information, not a problem. It means you should consider a qualitative validation step before committing traffic.

What does "pre-launch validation" mean and when should I do it?

Pre-launch validation means running copy variants through the Ad Copy Variant Generator's expert panel before you allocate live paid traffic to a test. It takes 10 minutes and eliminates weak copy variants before the 14-day live test window starts. Do it whenever the top hypothesis involves copy changes — headline variants, CTA variants, value prop framing. Skip it for structural changes (form placement, layout, visual hierarchy) since the panel doesn't score layout decisions reliably. The prompt makes this recommendation explicitly for the top hypothesis.