ICP Lead Scoring for B2B with Python and Claude API

ICP lead scoring for B2B: Use Python to parse CRM and ad platform CSV exports, then pass each lead to Claude with a scoring rubric to classify leads as True ICP, Near-Miss, Wrong Title, or Noise. In one campaign, 59% of MQLs were reclassified as wrong-fit, revealing a true CPL 2.3x higher than the reported number — a result that changed the budget decision entirely.

Here is the most dangerous sentence in B2B marketing: "Our MQL volume is up 40% this quarter." It sounds good. It gets used in board slides. It is almost completely meaningless without one additional piece of information: what were you buying?

Most lead scoring models are proxies for engagement, not fit. A lead scores high because they opened four emails and downloaded a whitepaper. But nothing in that scoring model tells you whether they're a VP at a 500-person company in your ICP or a coordinator at a 10-person agency who found your content via Google. Both of them clicked. Only one of them will ever buy.

I built a scoring system to fix this. It doesn't replace engagement scoring — it runs alongside it, as a fit layer. And the results were not what I expected.

The four ICP buckets

Every lead that enters the system gets scored across three dimensions: Title/Level, Function, and Company Size. Each dimension gets a score from 0 to 3 based on how closely it matches our ICP definition. The composite score determines which of four buckets a lead lands in: True ICP, Near-Miss, Wrong Title, or Noise.

True ICP is what it sounds like — leads that match all three criteria. Near-Miss is leads that match two of three, typically the right function and size but a title that's one level below the decision-making threshold. Wrong Title is the largest bucket: companies that match on size but where the person who converted is clearly not the buyer (coordinator, intern, analyst at a non-ICP company). Noise is everything else.

The critical design decision was how to define these buckets. I didn't want a scoring model I had to maintain manually — one where I'm updating "acceptable titles" in a spreadsheet every quarter. I wanted a system that could reason about fit. So I passed the scoring criteria and the raw lead data to Claude and asked it to score each record with a brief justification for why it was placed in its bucket.

34% Of "high-scoring" MQLs in one campaign turned out to be Wrong Title or Noise when run through ICP scoring — inflating CPL by more than 2x on a true-ICP basis.

What the data actually showed

The first time I ran this system against a live campaign dataset, the results were uncomfortable. One campaign that had been running for six weeks and had a CPL our team was proud of — comfortably below target — had a True ICP rate of 41%. That means 59% of the leads we'd been counting as MQLs were Near-Miss or worse. When I recalculated CPL using only True ICP leads, it was 2.3x the reported number and well above our target.

This wasn't a surprise to anyone who thought hard about it. The campaign in question was driving a lot of volume by targeting broad job function parameters on LinkedIn. Broad targeting generates cheap leads. Cheap leads are rarely the right leads. But nobody had quantified it before because we were measuring CPL against total MQL volume, not against ICP-qualified lead volume.

The fix wasn't complicated: narrow the targeting, accept higher CPL on the raw count, watch the True ICP CPL drop. But you can't make that decision without the data to support it, and most teams don't have it.

How to build this without a data team

The system I built accepts CSV exports from LinkedIn Ads, Google Ads, Marketo, and Salesforce. You don't need an engineering team. You need Python to parse and join the files, Claude to run the ICP scoring logic, and about four hours to set up the first time. After that, it's on-demand — pull the exports, run the script, get the report.

The output is an interactive HTML report that shows ICP distribution by campaign, by ad set, by creative, and by landing page. You can see exactly which campaigns are buying True ICP leads and which ones are inflating your MQL count. You can sort by True ICP CPL and finally make budget decisions that reflect reality, not vanity metrics.

The full PRD, including the Python processing logic, the Claude ICP scoring prompt, and the report template, is available to newsletter subscribers.

Frequently Asked Questions

What are the four ICP lead scoring buckets?

True ICP matches all three criteria: title, function, and company size. Near-Miss matches two of three, typically right function and size but title one level below the decision-making threshold. Wrong Title is the largest bucket — companies that match on size but where the person who converted is not the buyer. Noise is everything else.

How do I build an ICP scoring system without a data team?

You need Python to parse and join CSV exports from LinkedIn Ads, Google Ads, Marketo, and Salesforce, plus the Claude API to run the scoring logic. Initial setup takes about four hours. After that it's on-demand — pull the exports, run the script, get the interactive HTML report.

Why does true ICP CPL matter more than reported CPL?

Reported CPL counts all MQLs equally. True ICP CPL counts only leads that match your ideal customer profile. In one campaign, reported CPL was below target while true ICP CPL was 2.3x above target — a number that would have changed the budget decision entirely.

What does the ICP scoring output report show?

An interactive HTML report showing ICP distribution by campaign, ad set, creative, and landing page. You can see which campaigns are buying True ICP leads versus inflating MQL count, then sort by True ICP CPL to make budget decisions that reflect reality rather than vanity metrics.

Want the full system PRD?

Subscribe to The Demand Engine(er) — free — and get instant access to all 5 system PRDs.

Your CPL Is Down. Are You Buying the Right Leads? A Python + Claude Pipeline for ICP Scoring

ICP Lead Scoring for B2B: Grade Every Campaign by Lead Quality