rankseo.studio· /blog
EN/
./blog / 05· #GEO
By J. Ho·Published Apr 29, 2026·8 min

What the AI bots actually read: 30 days of GPTBot, ClaudeBot and PerplexityBot in our access logs

Meta
Published
Apr 29, 2026
Author
Reading
8 min
Tag
#GEO

**TL;DR** — We tailed 30 days of access logs across nine client sites in March-April 2026, isolating verified hits from GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot and Google-Extended. The five bots crawl very differently from Googlebot and from each other: PerplexityBot fires per-query in near-real time, ClaudeBot runs deep but sparse re-verification passes, and GPTBot does broad shallow sweeps. Three patterns dictate citation-readiness — sitemap completeness, internal-link reachability of fresh pages, and the absence of soft-403 walls on key URLs. Two of those we found broken on most audited sites.

Why we started reading the logs again

Most teams' AI-bot conversation in 2026 stops at "do we allow GPTBot in robots.txt?" That is the easy one-bit decision. The interesting question is what the bots actually fetch once you let them in, how often, and which URLs get re-fetched versus crawled once and forgotten. The answer is in the access logs, but most teams are looking at sampled CDN dashboards or Cloudflare bot reports, not the raw line-by-line. Sampled views miss the long-tail URL coverage problem entirely; the raw logs show it sharply.

There is also a quieter motivation: the citation tools (Profound, Goodie, the open-source ones we run) tell you what answers you appear in, but they cannot tell you whether the underlying page was even fetched in the relevant window. A citation-tool gap and a crawl gap need different fixes, and you cannot tell them apart without the logs. Half the "we lost citations on Perplexity" tickets we triaged this quarter turned out to be PerplexityBot returning 403 from a Cloudflare rule the client did not know was on, not a content or competitive issue at all.

How we ran the audit

Nine client sites — three SaaS, three DTC, three publisher — full nginx / Cloudflare logs streamed through a small Athena schema. We verified bot identity with reverse DNS where the operator publishes one. OpenAI, Anthropic and Perplexity all do; Google-Extended is the same Googlebot identity gated by the GE token, so reverse DNS resolves to googlebot.com regardless of training-eligibility. User-Agent alone is not enough — about 8% of traffic claiming to be GPTBot reverse-resolved to non-OpenAI ranges, almost certainly LLM-scraping infrastructure pretending to be the official bot. We dropped those rows from the analysis.

The window choice matters. We picked 30 days because that is the shortest window in which ClaudeBot's re-verification cadence is fully observable; shorter windows under-count its activity and over-weight GPTBot. For a one-off site audit you can usually get away with 14 days for the GPTBot/PerplexityBot picture, but you will miss most of what ClaudeBot does. We now keep a rolling 90-day partition for clients who care about cross-engine citation work; the marginal storage cost is trivial and the analytical leverage is significant.

What the per-bot crawl pattern looks like

GPTBot did the broadest sweeps, averaging 19,000 distinct URLs per site per month with shallow re-fetch — most URLs hit only twice in 30 days. ClaudeBot was the opposite: roughly 3,200 distinct URLs per month, but the URLs it did fetch were re-fetched 8 to 12 times and consistently included structured-data endpoints, sitemap variants and llms.txt where it existed. PerplexityBot was the most surgical — about 1,400 distinct URLs per month, but with sub-minute re-fetch on URLs that had just been cited in a live answer. Google-Extended sat between GPTBot and ClaudeBot in volume, but its fetch pattern was indistinguishable from the regular Googlebot Smartphone profile, just with a different UA string.

The shape difference matters because it tells you what each bot is doing. GPTBot is building an offline corpus. ClaudeBot is maintaining a smaller curated index with re-verification. PerplexityBot is fetching live to compose this minute's answer. The optimisation move is different per bot, and not every site needs all three. A B2B SaaS site whose audience is on ChatGPT cares most about GPTBot coverage; a fast-publishing news site cares most about PerplexityBot reachability; a knowledge-base or docs site cares most about ClaudeBot re-verification windows. Treating "AI bots" as one bucket — the way most monitoring dashboards still do in 2026 — hides the only piece of information you actually need.

The three readiness patterns we keep finding broken

Sitemap completeness was broken first. Across the nine sites, 14% of URLs that ranked or were cited inside the audit window were not in any sitemap. The bots that lean on sitemaps (ClaudeBot, Google-Extended) consistently under-fetched those URLs; the bot that does broader spidering (GPTBot) caught them but slowly. The fix is not exotic — regenerate sitemaps weekly, include `lastmod`, split into shards under 50,000 URLs. Most sites we audit still have one monolithic sitemap stuck on a 2023 cron. Internal-link reachability is the second: new posts published in the audit window took an average 4.1 days to be fetched by ClaudeBot when they had at least 3 internal links from already-crawled pages, and 22.7 days when they had only the sitemap entry. The internal link is not optional — it is the load-bearing freshness signal for any bot that re-verifies rather than full-corpus refreshes.

Soft-403 walls were the most painful and the easiest to fix. Cloudflare's "AI scraper challenge" rule and similar WAF features were silently 403-ing 6 to 11 percent of legitimate AI-bot traffic on three of the nine sites, despite the operators believing they had whitelisted the bots. The 403 returns to the bot, the bot drops the URL from its candidate set, and the page stops being citable. Auditing your WAF for AI-bot rules is the highest-leverage one-hour fix on most sites we look at; the test is straightforward — query your edge logs for `4xx` responses with bot-identified User-Agents and reverse-verified IPs, then walk the matched rule chain. We have not yet seen a site where that audit found nothing worth fixing.

What changes in the weekly review

We added an "AI-bot fetch coverage" row per content cluster: how many URLs in the cluster were fetched by each bot in the last 30 days, and the median re-fetch interval. The pattern we look for is "GPTBot covers it, ClaudeBot ignores it" — that is almost always a sitemap or internal-link problem, not a content problem. Where ClaudeBot covers it and GPTBot does not, the issue is usually canonical or `robots` meta-tag noise. The diagnostic is the bot, not the engine. We also flag any cluster where PerplexityBot has not fetched a single URL in 30 days; on a site where Perplexity citations matter, that is a four-alarm signal that needs investigating before the citations start dropping.

  • 01Verify your sitemap is regenerated at least weekly and includes every URL you want cited. The sites we audit show a 14% miss rate; the fix takes one cron line.
  • 02Make sure new posts get at least 3 internal links from already-crawled pages within 24 hours of publish. Without that, ClaudeBot and Google-Extended take ~22 days to discover them.
  • 03Audit your WAF / Cloudflare rules for soft-403s of AI bots. We find this broken on roughly a third of sites despite the team believing it is fine.
  • 04Reverse-DNS verify any "GPTBot" or "ClaudeBot" entries before treating them as real. About 8% of self-claimed GPTBot traffic in our sample was impersonators.

Where this argument breaks

For sites under ~5,000 pages, the long-tail sitemap problem is mostly self-correcting — GPTBot will sweep the whole site within a month regardless. For news and publisher sites with very high publish cadence, PerplexityBot's sub-minute re-fetch starts to dominate the conversation and the GPTBot/ClaudeBot patterns matter less. The Chinese-language picture is different again: Baidu's spider remains the load-bearer, and the AI-specific bots (元宝, 文心) crawl on patterns we are still mapping with enough sample size to publish on. Treat the numbers above as a Western English-language SaaS / DTC / publisher baseline, not a universal rule.

Further reading
/ KEEP READING
Previous
Bing Copilot in 2026: 30 days of citation audit data on the AI engine most teams ignore
Apr 29, 2026
Next
GA4 after AI Overviews: how we measure traffic from ChatGPT, Perplexity and Gemini in 2026
Apr 28, 2026

Want to see how this runs on your own site?

Drop your URL and email — we'll send a free standard SEO diagnostic.