rankseo.studio· /blog
EN/
./blog / 01· #recent
By J. Ho·Published May 14, 2026·8 min

Cited-paragraph anatomy: which 30% of your page the AI Overview composer actually extracts in 2026

Meta
Published
May 14, 2026
Author
Reading
8 min
Tag
#recent

**TL;DR** — Across 22 client sites in April 2026 we audited which specific paragraph of a cited page actually ends up extracted into the Google AI Overview answer card. The surprise: even on long pages, 94% of citations matched cleanly to a single paragraph, not a page-level synthesis — and three structural patterns explained almost everything the composer pulled. Tight claim-and-evidence prose inside the first 200 words of the page was extracted 58% of the time. The first paragraph immediately after an H2 whose text closely mirrored the implicit query was extracted 27% of the time. Compact 3–5 row tables answering a comparison query were extracted 9% of the time. Everything else combined — FAQ blocks, bulleted lists, image captions, conclusion paragraphs — accounted for less than 6%. Most teams write the cited paragraph by accident. The audit is about making it deliberate.

Why we ran the audit

Through 2025 we treated "the page is cited" as the unit of measurement, and treated the page as the unit of editing. By Q1 2026 the abstraction was too coarse to be useful. Two pages can both be cited on the same query and produce wildly different downstream CTR, citation duration and brand impact — partly because of the link-label and answer-completeness drivers we covered last week, but also because the composer is not extracting from the same part of both pages. One is being quoted from its first paragraph and rendered above the fold of the answer card; the other is being quoted from a sentence three sections in and rendered as a footnote at the bottom. Knowing which paragraph the composer is actually pulling lets you edit that paragraph, instead of rewriting the whole page and hoping something sticks.

There is a second motivation. Client-side editorial teams cannot read the composer's mind, but they can read a 30-day report that says "this paragraph from this URL was extracted N times" and use it as an editing checklist for next month's iteration. Cited-paragraph identification is the cheapest piece of intelligence we have added to an AI-search dashboard this year, and it has changed how the editorial team writes more than any other single number we have shipped to clients in 2026 — including the click-through and decay metrics that get more discussion.

How we ran the measurement

22 client sites — 8 SaaS, 7 publisher, 4 DTC, 3 B2B services — across April 2026. For every AI Overview citation captured by our standing 60-query basket per client, we did a substring-and-paraphrase match between the AI Overview prose and the cited source page, using a 30-token sliding window with a sentence-embedding fallback for paraphrases. Each match was assigned to the highest-scoring source paragraph or section. We then classified the source paragraph into one of six structural types: opening prose, post-H2 lead, in-body claim, bulleted list, table row, or FAQ answer. Pages where no paragraph matched at the 0.75 cosine-similarity threshold were re-reviewed by hand; around 3% of citations were genuinely synthesised across multiple paragraphs and we counted those separately rather than forcing them into the single-paragraph distribution.

Two normalisation moves matter. We dropped citations where the cited URL was a programmatic listing or category page — the composer in those cases extracts page-level metadata rather than any single paragraph, and the classification breaks down. We also excluded any extraction where the matched paragraph had been edited in the prior 14 days, because the audit measures steady-state composer behaviour, not the transient post-edit period where the composer occasionally re-extracts from the old cached version for a week before the new version takes over. The reported numbers below are for the steady-state population, which is what your editorial team can actually shape.

The shape of the extraction distribution

The single-paragraph result was the genuine surprise. We expected a meaningful fraction of citations to be synthesised across two or three paragraphs — the composer reading the whole page and producing a paraphrase that wove together claims from several sections. In practice, 94% of extractions matched to one paragraph cleanly, with the second-best paragraph contributing essentially nothing. The composer is doing paragraph selection first and paraphrase-or-quote second, not page-level summarisation. That shifts the editing target dramatically: there is one paragraph on your cited page that matters, and the other 700–900 words exist mostly to convince the composer that the cited paragraph is trustworthy enough to extract from.

Three paragraph types dominate. Opening prose — the first 200 words of the page, regardless of heading structure — accounted for 58% of all extractions. Post-H2 leads — the first paragraph immediately following an H2 whose text closely mirrored the implicit query — accounted for 27%. Compact tables — 3–5 rows answering a comparison-style query — accounted for 9%. The remaining ~6% was split across bulleted lists (3.1%), FAQ answers (1.8%) and everything else. The implication is uncomfortable for a lot of 2024-era SEO advice: lists and FAQ blocks, despite years of guidance telling teams to write them for snippet capture, are extracted from rarely in the AI Overview era — they still serve some purpose for traditional SERP features, but they are not load-bearing for answer-card extraction.

Driver one: opening prose as the default extraction target

58% of citations come from somewhere inside the first 200 words of the page. The composer reads the lead first, decides whether it is a credible direct answer to the implicit query, and extracts there if it is. The mechanism mirrors what we saw on the CTR side last week: pages with tight claim-and-evidence leads get extracted from the lead and rendered above the fold of the answer card, while pages with generic introductory prose ("In this article, we will explore...") get extracted later in the page, lower in the answer card, and click less. The two patterns reinforce each other — lead-shaped paragraphs are both more likely to be the extraction target and more likely to convert when they are.

What "tight claim-and-evidence" actually looks like in practice: a single sentence stating the answer to the page's primary query, followed by two or three sentences of supporting evidence (a number, a date, a named source, a measured outcome), followed by one sentence framing the rest of the page for the reader who chooses to keep going. Total length: 60–90 words. This is the paragraph the composer extracts when the lead is doing its job, and the extraction rate from leads in this shape was 84% in our audit, versus 31% for pages whose lead was a longer narrative scene-setter. The editorial cost of converting an existing lead into this shape is roughly 20 minutes per page; the extraction-rate change shows up within a single re-crawl cycle.

Driver two: post-H2 paragraphs that match the query

The second-largest extraction source is the paragraph immediately following an H2 whose text closely mirrors the implicit query, accounting for 27% of citations. The mechanism is also straightforward: when the composer reads the page, it parses the H2 structure as a table of contents, and when an H2 matches the implicit query closely (semantic similarity above roughly 0.7 in our checks), the composer treats the paragraph directly below it as a query-targeted answer and extracts from there — even when the lead also contains a credible answer. The H2-driven extraction overrides the lead-driven one when both are present.

This is operationally important because most pages have multiple potential extraction surfaces. A page on "how to fix slow INP" might have a strong lead and also an H2 reading "what causes slow INP." Without the matching H2, the lead dominates. With the matching H2, the post-H2 paragraph can pull extraction away from the lead — and if the post-H2 paragraph is weaker prose than the lead, the citation lands on the weaker paragraph, which then renders as a weaker answer-card quote. We now audit every commercial-intent page for H2 matches against the primary query basket, and either tighten the post-H2 paragraph or remove the H2 entirely if the lead is the stronger extraction candidate. Both moves measurably improve extraction quality, and the second one is faster.

Driver three: compact tables on comparison queries

Tables are a smaller share of total extractions (9%) but they punch above their weight on the queries where they do get extracted — almost always comparison queries ("X vs Y", "best X for Y", "which X is best for Z") where the composer treats the table as a structured answer it can paraphrase into a clean comparison sentence or, more often in 2026, render alongside its own mini-table inside the answer card. The tables that consistently get extracted share three properties: 3–5 rows (smaller and the composer paraphrases the surrounding prose instead; larger and the composer extracts a single row, making the citation feel arbitrary), a clear column header in the leftmost column so the composer can align row labels with the query, and a prose sentence immediately above the table naming what the comparison is about.

Tables embedded inside tab-switchers, accordion panels, or any JavaScript widget were extracted at a fraction of the rate of plain `<table>` markup — even when the markup was technically present in the DOM after hydration. The composer is reading the server-rendered HTML, and an interactive widget that hides the table behind a click is functionally invisible to the extraction step. We dropped two patterns this quarter on the back of this finding: tabbed comparison widgets and accordion-style spec tables. The accessibility argument for them is small, the citation cost is large, and the plain `<table>` alternative renders perfectly well on every device the actual users care about.

What changed in our content checklist

Four additions. We now rewrite the first 200 words of every commercial-intent page into a tight claim-and-evidence answer, with the original narrative lead pushed down to the second paragraph. We audit every page's H2 structure against the primary query basket, tightening the post-H2 paragraph when the H2 matches a query and removing the H2 entirely when the lead is the stronger extraction candidate. We refactor every comparison table to 3–5 rows of plain `<table>` markup with a clear leftmost-column header and a prose sentence immediately above the table. And we run a monthly "extraction audit" — for every cited URL, we identify which paragraph the composer is currently extracting from, and we treat that paragraph as the explicit editorial target for next month's iteration.

We dropped one habit. Through 2024 and 2025 we coached writers to add rich, varied FAQ blocks at the bottom of every commercial-intent page on the assumption they would be picked up for snippets or answer cards. In our 2026 audit, FAQ blocks were the extraction source on less than 2% of citations. They still serve a purpose for some traditional SERP features and for accessibility, but they are not load-bearing for AI Overview extraction, and we no longer prioritise the editorial budget they used to consume. The hours go into the lead and the post-H2 paragraphs instead, where the extraction actually happens.

  • 01Audit which paragraph the composer actually extracts from for every cited URL. 94% of citations match cleanly to one paragraph; the rest of the page exists mostly to convince the composer that paragraph is trustworthy.
  • 02Rewrite the first 200 words of every commercial-intent page into a tight claim-and-evidence answer. Leads in this shape were extracted from at 84% vs 31% for narrative-style leads.
  • 03Match your H2s to your primary query basket and tighten the paragraph directly below each match. Post-H2 paragraphs account for 27% of all extractions and routinely override the lead.
  • 04Refactor comparison tables to 3–5 rows of plain `<table>` markup with a clear leftmost column. Tables hidden inside tab-switchers or accordions were extracted at a fraction of the plain-markup rate.

Where this argument breaks

For sites with fewer than about 30 cited URLs in a given month, the per-paragraph statistics are too sparse to attribute extraction patterns reliably, and the audit becomes a qualitative review of individual pages rather than a portfolio-level analysis. For programmatic sites where the entire body is templated, the extraction distribution is dominated by template-level choices rather than per-page editorial decisions, and the audit needs to target the template rather than the page. In Chinese-language search, 文心 and 通义 treat structured data differently from Google — 文心 in particular extracts from tables at roughly twice the rate Google does, and the 9% headline figure does not transfer. Outside those carve-outs, knowing which paragraph the composer is actually extracting is the single most actionable piece of intelligence we have added to client dashboards in 2026, and most editorial teams are still operating as if every word on the page weighed the same.

Further reading
/ KEEP READING
// no earlier post
Next
AI Overview citation click-through in 2026: when being cited actually produces a visit
May 13, 2026

Want to see how this runs on your own site?

Drop your URL and email — we'll send a free standard SEO diagnostic.