The Prompt
You are a senior B2B SaaS conversion optimization specialist generating A/B test hypotheses for paid landing pages and ad creative. You know that hypothesis quality determines test outcomes: vague hypotheses produce inconclusive results, while specific hypotheses produce wins or losses you can build on.
INPUTS
I will paste the current state below — what's running, what the performance is, and what change is being considered.
{PASTE_CURRENT_STATE_HERE}
(Example: "Enterprise landing page is converting at 1.8%. Hero headline reads 'The Website Experience Platform.' Bounce rate is 62%.")
{PASTE_OBSERVED_PROBLEM_HERE}
(Example: "Senior visitors are bouncing without scrolling past the hero. The headline tested poorly with our panel.")
{OPTIONAL_PASTE_CONSTRAINTS_HERE}
(Examples: "Cannot change form fields," "Brand team approval required for hero imagery." Leave blank if no constraints.)
WHAT I NEED FROM YOU
Generate 5 prioritized A/B test hypotheses. Produce the output in this exact order:
1. Hypothesis List
5 distinct hypotheses, each in the format: "If we change [variable], we expect [outcome] because [reasoning]." Each hypothesis should target a different lever:
- Hero headline / value prop
- CTA copy or placement
- Social proof / trust signals
- Form fields or form flow
- Visual hierarchy or above-the-fold layout
2. Prioritization
Score each hypothesis on three dimensions (1-5 scale):
- Impact: how much could this move the conversion rate if it wins?
- Confidence: how strong is the evidence that this is the right lever?
- Ease: how hard is this to test (engineering effort, design effort, approvals required)?
Calculate the priority score: Impact × Confidence × Ease. Rank from highest to lowest.
3. Test Design for Top Hypothesis
For the highest-priority hypothesis:
- Variant A (control): describe
- Variant B (treatment): describe
- Primary success metric: state explicitly (form completion rate, MQL pull-through, etc.)
- Secondary metrics to monitor (bounce rate, scroll depth, etc.)
- Required sample size: estimate based on current traffic and expected lift
- Estimated test duration
4. Pre-Launch Validation Recommendation
Recommend whether this hypothesis is strong enough to deploy to live traffic, or whether it should first go through autoresearch panel scoring (see /prompts/ad-copy-variant-generator/) to validate the copy variants before allocating paid traffic.
JUDGMENT RULES
- A good hypothesis is specific. "Improve the headline" is not a hypothesis. "Replace the current generic value-prop headline with one that names a specific buyer pain to reduce senior-level bounce" is a hypothesis.
- Hypotheses must tie to a primary metric, not just a directional improvement. "Increase trust" is not a measurable outcome; "Increase form completion rate by 15%" is.
- Confidence comes from evidence: heat maps, session recordings, qualitative feedback, panel scoring results, competitor analysis. If a hypothesis has no evidence beyond opinion, mark its Confidence score low.
- Ease is not the same as engineering time. A copy change might be "easy" technically but require Brand team approval, which lowers Ease in practice. Account for soft constraints.
- Impact × Confidence × Ease prioritizes for compounding wins. A high-impact, high-confidence, hard-to-build test still might not be the right next test if a high-impact, high-confidence, easy-to-build test is also on the list.
- If you don't have enough context to evaluate a hypothesis (especially Impact and Confidence), say so. Do not assign arbitrary scores to fill in the table.
OUTPUT FORMAT
Return as {OUTPUT_FORMAT}.
If "markdown": hypothesis list, prioritization table, test design for top hypothesis, recommendation.
If "html": styled report with the prioritization table prominently displayed and the top hypothesis detailed.
Begin.
How to Use It
This prompt is built for the question that follows every conversion rate problem: "what should we test first?" The Impact × Confidence × Ease prioritization framework comes from real experimentation program PRDs — it's designed to surface the tests that have the highest likelihood of producing a meaningful, replicable result, not just the tests that are easiest to run or feel intuitively right. Claude applies this framework more precisely than ChatGPT in production use — Claude is more disciplined about flagging low-confidence hypotheses ("there's no data to support this beyond opinion") rather than assigning high confidence scores across the board.
Each hypothesis follows the format "If we change [variable], we expect [outcome] because [reasoning]" — the "because" is the critical part. A hypothesis without a causal mechanism isn't a hypothesis; it's a guess. The five hypotheses in the output each target a different conversion lever: hero headline, CTA copy or placement, social proof and trust signals, form fields or form flow, and visual hierarchy. Covering all five levers with one run of the prompt gives you a ranked backlog, not just one test idea.
The pre-launch validation recommendation at the end is the most-overlooked part of the output. If the top hypothesis involves copy variants, the prompt recommends running those variants through the Ad Copy Variant Generator's expert panel before allocating live traffic — that pre-validation step can eliminate weak variants in minutes rather than after 14 days of test traffic. If the hypothesis involves layout or flow changes, it goes directly to live testing since panel scoring isn't calibrated for structural changes.
Example Output
Example output coming soon — currently running this prompt against live data and will publish the redacted output once it's ready.
Common Failure Modes
Variations
Two variations of this prompt are worth knowing.
Variation 1: Ad Creative Hypothesis Version
Adapted for generating hypotheses about ad creative variables rather than landing page elements — testing ad headline angles, visual formats, CTA specificity, and audience-signal alignment. Uses the same Impact × Confidence × Ease framework scoped to paid creative decisions.
[PROMPT GOES HERE]
Variation 2: Micro-Hypothesis Version (Single Element)
Narrowed to a single element (e.g., just the CTA button copy or just the form placement) when the team has already identified the lever and needs five distinct hypotheses for that specific variable rather than five hypotheses across different levers.
[PROMPT GOES HERE]
Get one new prompt every Monday.
Plus the system behind it. Free. Built for in-house demand gen managers at B2B SaaS companies.
Subscribe free →Frequently Asked Questions
What's the difference between Impact, Confidence, and Ease in the scoring?
Impact is how much the change could move the primary metric if the hypothesis is correct — a hero headline change on a high-traffic page has higher Impact than a CTA tweak on a low-traffic page. Confidence is how strong the evidence is that this is the right lever — a hypothesis backed by heat map data and session recordings has higher Confidence than one based on opinion. Ease is how hard the test is to implement in practice, including soft constraints like Brand approval or engineering dependency. The three scores multiply, so a high-impact, high-confidence, hard test (5×5×1) scores lower than a high-impact, moderate-confidence, easy test (5×3×4).
How specific should my "current state" input be?
The more specific, the better the hypotheses. "Enterprise landing page is converting at 1.8%. Hero headline reads 'The Website Experience Platform.' Bounce rate is 62% for senior visitors from paid traffic" gives the model enough signal to generate hypotheses with real causal reasoning. "Landing page performance isn't great" produces generic hypotheses with low actionability. Include your current conversion rate, the specific copy or layout you're testing against, any behavioral data you have (heat maps, scroll depth, session recordings), and any qualitative feedback from the panel or from sales.
Can I use this prompt if I don't have heat map or session recording data?
Yes, but the Confidence scores will be lower for all hypotheses, which is the correct output. A hypothesis based only on intuition ("I think the headline is too generic") has lower Confidence than one backed by data ("72% of senior visitors from LinkedIn bounce without scrolling past the hero, and the panel scored the headline 52/100 on the Skeptical Buyer dimension"). The prompt marks low-confidence hypotheses explicitly — that's useful information, not a problem. It means you should consider a qualitative validation step before committing traffic.
What does "pre-launch validation" mean and when should I do it?
Pre-launch validation means running copy variants through the Ad Copy Variant Generator's expert panel before you allocate live paid traffic to a test. It takes 10 minutes and eliminates weak copy variants before the 14-day live test window starts. Do it whenever the top hypothesis involves copy changes — headline variants, CTA variants, value prop framing. Skip it for structural changes (form placement, layout, visual hierarchy) since the panel doesn't score layout decisions reliably. The prompt makes this recommendation explicitly for the top hypothesis.