Claude Prompt for Ad Copy Variant Generation with Expert Panel Scoring

Ad copy variant generator: Generate and pre-validate B2B ad copy variants by feeding Claude a copy element, target audience, and channel. The prompt produces 10 variants across 10 distinct angles, scores each against a 5-persona expert panel (CMO, Skeptical Buyer, CRO Specialist, Senior Copywriter, ROI-Focused CEO), and returns the top 3 winners with a recommendation — replacing weeks of live A/B testing with a 10-minute autoresearch loop.

The Prompt

Production Prompt — Copy and use verbatim

You are an autoresearch system for B2B SaaS ad copy. You generate copy variants and score them with a simulated 5-person expert panel before any traffic hits the page. You know that traditional A/B testing limits a team to 20-30 variants per year because of the 14-day wait period — and that pre-launch panel scoring breaks that limit by generating and validating 50+ variants in under 10 minutes.

INPUTS

I will paste the copy element to test below. Could be: a landing page headline, a CTA button, ad copy for a specific channel, a form field label, a thank-you page message. Also include the target audience and the channel.

{PASTE_COPY_ELEMENT_HERE}

{PASTE_TARGET_AUDIENCE_HERE}
(Example: "Senior decision-makers in [function] at [company stage / size] companies evaluating [product category].")

{PASTE_CHANNEL_HERE}
(Example: "LinkedIn Sponsored Content," "Google RSA headline," "Enterprise landing page hero.")

WHAT I NEED FROM YOU

Run the autoresearch loop and produce the output in this exact order:

1. Generate 10 Variants
Produce 10 distinct variants of the copy element. Each variant should take a different angle:
- Pain-agitate-solve
- Before / after
- Provocative question
- Social proof
- Specific outcome / number
- Authority / credibility
- Contrarian / counter-conventional
- Urgency / scarcity (only if appropriate to the brand)
- Direct / no-frills
- Problem-first

Each variant: 1 line of copy.

2. Expert Panel Scoring
Score every variant 0-100 against each of the five panel personas. Each persona scores against their specific question:

- CMO / VP Marketing (senior marketing leader at a mid-market or enterprise B2B company): "Would this make me stop scrolling?"
- Skeptical Buyer (budget owner who has seen every pitch): "Do I believe this claim?"
- CRO Specialist (conversion expert reviewing the page): "Is this clear and action-driving?"
- Senior Copywriter (B2B SaaS copywriting expert): "Is this compelling and differentiated?"
- ROI-Focused CEO (enterprise decision-maker evaluating vendors): "Would I put this on my site?"

Output as a table: Variant | CMO | Skeptical Buyer | CRO | Copywriter | CEO | Average.

3. Top 3 Winners
The three highest-scoring variants by average. For each: state the score, the strongest panel score (which persona scored it highest and why), and the weakest panel score (which persona scored it lowest and why).

4. Cross-Bred Combinations (optional, second round)
If multiple elements are being tested in parallel (e.g., headline AND CTA), generate 6-9 cross-bred combinations from the top 3 winners of each element. Score the combinations as complete units.

5. Recommendation
Single recommended variant to deploy as the autoresearch winner. State why this one over the second-place finisher.

JUDGMENT RULES

- The five-persona panel is calibrated to surface different failure modes. A variant that scores 90 from the CMO but 50 from the Skeptical Buyer is too punchy and not credible. A variant that scores 90 from the Copywriter but 50 from the CRO is clever but not conversion-driving. The average is informative; the spread is more informative.
- Generic CTAs ("Get Started," "Learn More") score systematically lower than specific CTAs ("Book My Enterprise Demo," "See Pricing for 500+ Seats"). The CTA specificity effect is approximately 14 points on the panel scale.
- High-friction form fields (phone, company size) score lower when placed early in a form. The CRO panelist will mark these down 20-30 points relative to the same fields placed later.
- "Human language" in thank-you copy outperforms corporate language. "Our team reviews every submission personally — expect a direct reply within one business day, not a nurture sequence" outscores "Your request has been received."
- Do not score every variant in the 80-90 range. The panel is a forcing function; if every variant scores high, the panel is broken. Real distributions show 40-90 ranges with clear winners.
- If you don't have enough context to score a variant honestly, say so. "Insufficient context on target audience pain points to score this variant against the Skeptical Buyer persona" is the right answer when you're not sure.

OUTPUT FORMAT

Return as {OUTPUT_FORMAT}.

If "markdown": variants list, scoring table, top 3 detail, recommendation.
If "html": styled report with the scoring leaderboard prominently displayed.

Begin.

How to Use It

Traditional A/B testing limits a team to 20–30 variants per year because of minimum traffic requirements and 14-day wait periods per test. The autoresearch approach in this prompt breaks that limit by using a simulated expert panel to pre-validate variants before any traffic hits the page. In practice, the workflow is: run the autoresearch prompt to identify the top 3 variants from 10 generated options, then run those 3 as live tests. You're testing validated winners against each other rather than testing blind.

Claude (Sonnet or Opus) is the right model for this task. The 5-persona scoring requires holding distinct critical perspectives simultaneously — the CMO scores for attention-stopping, the Skeptical Buyer scores for believability, the CRO scores for action orientation, the Copywriter scores for craft, the ROI-Focused CEO scores for enterprise credibility. Claude maintains those distinct voices more reliably than GPT-4 class models, which have a tendency to score most variants in the 80–90 range. The "hallmark of a broken panel" warning in the prompt is there because GPT-4 does exactly this.

The cross-bred combinations section (Step 4) is the highest-leverage feature. When you're testing multiple elements simultaneously — headline AND CTA — generate variants of each separately, identify the top 3 winners per element, then generate cross-bred combinations. A top-3 headline × top-3 CTA grid produces 9 combinations to test, all of which have already been panel-validated. This is where the compound efficiency of the autoresearch approach really shows.

Example Output

Live Example

Element tested: Enterprise landing page hero headline
Audience: Senior marketing and web ops leaders at 500+ employee B2B SaaS companies
Channel: Enterprise landing page (direct + paid)

10 Variants Generated

1. (Pain-agitate-solve) Tired of waiting six weeks for every landing page launch?
2. (Before/after) Your enterprise sites, built without the engineering bottleneck.
3. (Provocative question) Why is your marketing team still blocked by your codebase?
4. (Social proof) The website platform 14,000+ enterprise teams trust.
5. (Specific outcome) Launch enterprise landing pages 4x faster — without writing code.
6. (Authority) Built for marketing teams that ship as fast as product teams.
7. (Contrarian) Most "no-code" platforms can't handle enterprise. We can.
8. (Direct) Enterprise-grade websites. Marketing-owned.
9. (Problem-first) Engineering shouldn't be your bottleneck for every landing page.
10. (Specific outcome) Cut your time-to-launch from 6 weeks to 6 days.

Expert Panel Scoring

| #  | CMO | Skeptical | CRO | Copywriter | CEO | Avg  |
|----|-----|-----------|-----|------------|-----|------|
| 1  | 78  | 52        | 71  | 64         | 58  | 64.6 |
| 2  | 82  | 71        | 84  | 79         | 81  | 79.4 |
| 3  | 88  | 49        | 76  | 81         | 62  | 71.2 |
| 4  | 71  | 78        | 68  | 54         | 76  | 69.4 |
| 5  | 84  | 81        | 89  | 86         | 88  | 85.6 |
| 6  | 74  | 68        | 71  | 78         | 72  | 72.6 |
| 7  | 81  | 64        | 79  | 84         | 71  | 75.8 |
| 8  | 69  | 78        | 81  | 67         | 79  | 74.8 |
| 9  | 79  | 72        | 82  | 81         | 76  | 78.0 |
| 10 | 86  | 82        | 88  | 84         | 87  | 85.4 |

Top 3 Winners

1. Variant 5 — "Launch enterprise landing pages 4x faster — without writing code." (85.6)
   Strongest: CRO (89) — clearest action and outcome in the set.
   Weakest: Skeptical Buyer (81) — still believes it, just wants proof. Pair with specific case study below the fold.

2. Variant 10 — "Cut your time-to-launch from 6 weeks to 6 days." (85.4)
   Strongest: CEO (87) — concrete business impact, easy to defend to a board.
   Weakest: Copywriter (84) — sharp but the "6 to 6" phrasing is slightly twee. Still high overall.

3. Variant 2 — "Your enterprise sites, built without the engineering bottleneck." (79.4)
   Strongest: CRO (84) — clean and direct.
   Weakest: Copywriter (79) — solid but lacks a specific outcome. Better as subhead than headline.

Recommendation
Deploy Variant 5: "Launch enterprise landing pages 4x faster — without writing code."

It edges Variant 10 on panel average (85.6 vs 85.4) but more importantly has the highest CRO score in the set (89). The "4x faster" framing is tied to a measurable benefit that the Skeptical Buyer can pressure-test against a case study. Variant 10's "6 to 6" is sharper rhetorically but slightly harder to substantiate. Recommend pairing Variant 5 as hero headline with Variant 2 as subheadline.

Illustrative example based on the PRD's documented autoresearch methodology (62→87 score benchmark, CTA specificity effect). Will be replaced with a redacted live autoresearch run.

Common Failure Modes

Compressing all scores into 80-90. First version was too generous — every variant scored 80+, which defeats the point of the panel as a forcing function. The "do not score every variant in the 80-90 range" rule pulled distributions back to realistic 40-90 spreads. If you see all your scores cluster high, re-run with the constraint emphasized.
Treating average as the only signal. Two variants can both average 80, but one scores (90, 80, 80, 80, 70) and the other scores (80, 80, 80, 80, 80). The first one has a real weakness (Skeptical Buyer at 70 = credibility gap) that needs fixing. The spread is at least as informative as the average — the model now surfaces both.
Defaulting to generic CTAs. When asked to generate CTA variants, the model still drifts toward "Get Started" and "Learn More" if you don't explicitly ban them. The CTA specificity effect is documented as ~14 points on the panel scale — naming the principle in JUDGMENT RULES helps, but you may need to add "do not include generic CTAs" to the input itself.

Variations

Two variations of this prompt are worth knowing.

Landing Page Headline Focus

Scoped specifically to landing page hero headlines, with the expert panel calibrated for B2B SaaS landing page conversion rather than ad engagement. Includes additional CRO-specific scoring criteria. Use this version before running your hypotheses through the A/B Test Hypothesis Generator.

Coming soon


[PROMPT GOES HERE]

LinkedIn-Specific Ad Copy

Adapted for LinkedIn Sponsored Content ad copy, accounting for LinkedIn's specific engagement patterns — professional context, news feed placement, character limits, and the expectation that the reader is in work mode. Links to the Ad Creative Audit prompt for scoring the live results.

Coming soon


[PROMPT GOES HERE]

Get one new prompt every Monday.

Plus the system behind it. Free. Built for in-house demand gen managers at B2B SaaS companies.

Frequently Asked Questions

Does this work with ChatGPT or only Claude?

Use Claude. The 5-persona panel scoring requires holding genuinely distinct critical perspectives — not just variations on "this is good." GPT-4 class models tend to score variants in a compressed range (80–90 for everything) because they default to positive framing. Claude scores more honestly, including low scores for weak variants, which means the rank order is actually informative. If every variant scores 85 or above, the panel is broken — check your model choice first.

How do I use the panel scores to pick what to test live?

Focus on the average score AND the spread. A variant that averages 85 with consistent scores across all five personas is a strong candidate. A variant that averages 85 because the CMO scored it 98 and the Skeptical Buyer scored it 65 is a polarizing variant — it might work if your audience skews toward senior marketers, but it may fail if the skeptical buyer persona is more representative. Use the spread to understand which audience risks each variant carries.

The prompt mentions a 14-point delta between "Get Started" and "Book My Enterprise Demo" — is that real?

Yes. Specific CTAs consistently outperform generic ones in the panel scoring because they score significantly higher with the CRO Specialist and ROI-Focused CEO personas. "Get Started" fails on "Is this clear and action-driving?" because it doesn't specify what starting entails. "Book My Enterprise Demo" passes that test. The specificity effect holds in real A/B tests as well — this isn't a panel artifact, it's a real conversion pattern.

Can I adapt this for B2C copy or email subject lines?

Yes for email subject lines — the panel scoring logic translates cleanly to email engagement prediction. For B2C copy, swap out the expert panel personas for ones calibrated to your B2C audience. The CMO and Skeptical Buyer personas can stay; replace the ROI-Focused CEO with a persona appropriate to your product category. The structural logic — generate variants, score against distinct perspectives, identify winners by average and spread — works for any copy optimization task.