Web scraping prevention

On a content site in 2026, the bot population is dominated by AI crawlers, scraper-as-a-service operators, and AI agents acting on behalf of users, which makes scraping defense most of bot defense rather than the small slice it used to be. The hard part is keeping legitimate search and discovery working while making mass harvest uneconomic, and that turns on three practical questions: what robots.txt actually enforces, which defensive layers hold up against current scrapers, and how the architecture distinguishes friendly bots from harvesters.

For specific scraping cases, see price scraping and datacenter proxy detection. For the framework-specific detection detail, Puppeteer, Playwright, and anti-detect browser detection.

The shape of the problem

A few statistics that set the frame for 2026:

More than half of all internet traffic is automated (Imperva 2025 Bad Bot Report).
AI training and live-retrieval crawlers account for a fast-growing share of that automation. Cloudflare’s managed robots.txt feature has been turned on by more than 2.5 million sites to disallow AI training, which gives a sense of how widely the problem is felt (Cloudflare: Declaring your AIndependence).
1 in every 18 requests using a recognized AI crawler User-Agent is fake, according to HUMAN’s Satori threat intelligence team. Attackers impersonate ChatGPT-User, MistralAI-User, Perplexity-User and similar to dress up unauthorized scraping as legitimate AI traffic.

Data How well sites are actually protected against modern bots

2.8%

fully protected

Fully protected 2.8%
Partially protected 36%
Failed every test 61%

Source: DataDome 2025 Global Bot Security Report (17,000 sites tested)

The share of fully-protected sites dropped from 8.4% in 2024 to 2.8% in 2025 as AI-powered scrapers outpaced static defenses. 61% of sites failed every test, leaving them vulnerable to both basic bots and modern AI-powered threats.

The combination means that “block scraping” is no longer a single policy. The defense has to handle:

Verified AI crawlers that you may want to allow under specific terms.
Unverified scrapers impersonating AI crawlers.
Headless browser farms harvesting content for resale.
AI agents acting on behalf of authorized users.
Search engine crawlers that you want to allow.
Adversarial scrapers using residential proxies and anti-detect browsers.

Each gets a different policy.

What robots.txt actually does

Robots.txt is a 1994 protocol. In principle, a crawler reads /robots.txt, parses the Disallow rules for its User-Agent, and respects them. In practice, compliance is voluntary, the protocol has no enforcement mechanism, and roughly half of AI crawler traffic ignores it in 2026.

For the bots that do respect it, robots.txt is still useful. Googlebot, Bingbot, Applebot, and the major AI crawlers from OpenAI, Anthropic, Perplexity and Google all honor it when configured correctly.

The right posture for a content site:

Maintain a robots.txt that explicitly addresses the major AI crawlers by name (GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, Amazonbot, CCBot, Bytespider).
Distinguish training from live retrieval. Disallowing GPTBot blocks training data collection but allows ChatGPT-User to retrieve articles when users ask ChatGPT about them; that distinction matters for many sites’ visibility strategy.
Treat robots.txt as policy, not enforcement. Enforce separately at the request layer.

Cloudflare’s managed robots.txt feature does the maintenance automatically and adds enforcement on top via their AI Crawl Control product (Cloudflare: AI Crawl Control). For non-Cloudflare deployments, the maintenance is manual and the enforcement is your own.

llms.txt and the policy file ecosystem

Several proposals have emerged for richer policy files than robots.txt:

llms.txt. A proposed format for declaring how a site wants to be used by LLMs specifically. As of April 2026, publishing llms.txt is largely symbolic; very few crawlers implement support for parsing it.
ai.txt. A similar proposal focused on AI training opt-out.
/.well-known/agent.json. A proposal from the agentic-AI community for declaring how a site exposes capabilities to agents. Restricting access to this file is sometimes recommended as a way to make a site less discoverable to autonomous agents.

The mature posture is to treat these as forward-looking policy declarations rather than active controls. They communicate intent, but none of them will stop a determined scraper.

What enforcement actually looks like

Defending “public” pages is worth the effort even before any data-value argument, because scraping costs the victim real infrastructure money: bandwidth served to harvesters, rendering and pricing compute burned on synthetic requests, and CDN caches polluted by enumeration patterns no real user produces. For some content sites, scraper traffic is the single largest line item in serving costs.

Five layers of enforcement, in order of cost and value.

Layer 1: Cheap network signals

Block obvious automation at the network edge:

Datacenter and known proxy IPs for content endpoints that should be served only to consumer users. Datacenter IPs hitting your homepage are scrapers; hitting your search endpoint, they are something between scrapers and SEO tools. The right block-vs-throttle decision is per route.
Known abuse ASNs. A handful of hosting providers concentrate disproportionate scraping traffic.
TLS fingerprint signatures matching known scraping libraries (tls-client, curl-impersonate, requests/aiohttp builds).
HTTP/2 SETTINGS frames inconsistent with the claimed User-Agent.

This layer catches the bottom 40 to 60% of scraping traffic at almost no cost.

Layer 2: User-Agent verification for claimed-friendly bots

For requests claiming to be Googlebot, GPTBot, ClaudeBot, etc., verify the claim:

Reverse-DNS verify against the operator’s published domain.
Forward-resolve and confirm the IP matches.
Or, where supported, check the source IP against the operator’s published IP-range JSON.
Or, where the operator participates in Web Bot Auth, verify the cryptographic signature.

If verification succeeds, apply the per-bot policy (often “allow, but rate-limit”). If verification fails, treat the request as an unverified scraper impersonating a friendly bot. HUMAN’s data showing 1-in-18 fake AI crawler requests is the reason this verification step is non-optional.

Layer 3: Browser-level signals

For requests that pass the network layer, run the JavaScript SDK and collect:

Fingerprint signals for headless browser detection of the major frameworks (Puppeteer, Playwright, Selenium).
Anti-detect browser detection.
Behavioral snapshot: time-on-page, mouse movement, scroll events.

Sessions that produce no behavioral signal are almost always automation regardless of how clean the fingerprint looks.

Layer 4: Rate and pattern controls

Per-device velocity limits that survive IP rotation. The operator can rotate IPs but the device fingerprint persists within a session.
Per-session resource limits. Total page requests, total bytes served, time on site. A session that fetches 1,000 pages in five minutes is harvesting.
Per-visitor limits. The visitor fingerprint catches an operator running many sessions in parallel from the same machine.
Sequence pattern detection. A scraper that paginates through search results, product listings, or article archives has a recognizable sequential pattern that real users do not produce at scale.

Layer 5: Content-level tactics

For the operators who get past layers 1 through 4 (typically the most sophisticated ones running headed Chrome under Xvfb on residential proxies with anti-detect browsers and well-crafted behavioral emulation):

Honeypot links invisible to humans (display: none, visibility: hidden, off-screen positioning). A request that fetches one is automation. Cloudflare’s AI Labyrinth product builds a maze of these and traps scrapers in endless loops (Cloudflare: Block AI crawlers).
Watermarking and canary content. Insert per-session unique tokens in content so scraped corpora can be traced back to source sessions.
Stale or poisoned content. Serve cached, outdated, or deliberately incorrect content to suspected scrapers. The detection signal you give up is the cost; the impact on the scraping operation’s data quality is the benefit.
Proof-of-work challenges on suspected scraping sessions. Adds computational cost to every page fetch.
API-first strategy for valuable data. Serve the structured data via an authenticated API. Charge for access. The structured API is easier to defend than HTML.

Letting legitimate crawlers through

A scraping defense that catches Googlebot or breaks AI-assisted search will damage the business it is supposed to protect. The mature approach is to allow-list verified friendly bots explicitly:

Maintain a list of friendly bot identifiers (User-Agent strings, IP ranges, signed-agent identities).
Verify each request claiming to be on the list.
Apply per-bot policy: rate limit, but otherwise allow.
Log the verified-bot traffic separately so anomalies are visible.

The list to maintain at minimum:

Search: Googlebot, Bingbot, Applebot, DuckDuckBot, YandexBot.
AI search: ChatGPT-User, Claude-User, Perplexity-User, OAI-SearchBot, Claude-SearchBot.
AI training (decide per site): GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, Amazonbot.
Link previews: Twitterbot, Slackbot-LinkExpanding, LinkedInBot, Discordbot, WhatsApp, iMessage.
Internal monitoring: Pingdom, Datadog Synthetics, UptimeRobot, your own first-party agents.

Cloudflare maintains a curated managed-list and rotates it as new bots are documented or as bots change behavior. For non-Cloudflare deployments, the No Hacks AI User-Agent Landscape reference is a useful starting point (No Hacks: The AI User-Agent Landscape in 2026).

AI agents on behalf of users

The category that has changed most in 2026. A user instructs their AI agent (Operator, Computer Use, Browser Use) to summarize an article, fill in a form, or buy a product. The agent shows up at your origin with the user’s session cookies but is driving Chromium programmatically.

The right framing is that this is authorized automation. The user wants the agent to act on their behalf. Blocking the agent is blocking the user.

The policy that works:

For authenticated sessions where the user has explicitly attached an agent (per the agentic identity model), allow the agent within the user’s normal rate limits.
For authenticated sessions where an agent is detected but not pre-authorized, allow with logging. Optionally, require step-up for high-trust actions.
For unauthenticated agentic traffic, treat as anonymous automation. Apply the normal scraper defense stack.

We cover this in detail in AI agent detection.

Per-route defense is the right granularity

A site-wide scraping policy is too coarse. Different routes have different threat models and different friction tolerances.

Homepage and product listings. Open to crawl. Friendly bots welcome. Anonymous bots throttled.
Article pages. Open to crawl. Friendly AI bots welcome under whatever your AI training policy is. Throttle aggressive paginators.
Search results. Throttle all bots aggressively. Real users do not paginate through 200 pages of search results.
Pricing API or product detail JSON. Strict authentication. Heavy throttling per session.
Account-related routes. Block all unidentified automation.
Checkout. Block all automation.

The Foil approach is to ship signals (verified bot, named framework, signed agent, anonymous automation, anti-detect browser, real human) and let the application enforce per-route policy. Static site-wide rules will miss the granularity that matters.

How Foil supports it

Foil’s role in scraping defense is the classification layer that lets the application apply per-route policy. The SDK collects fingerprint, network and behavioral signals; the decision carries a bot-or-human verdict, attribution labels that name the framework and the agent product behind the session, and a visitor fingerprint that identifies the same operator across rotating proxies and sessions.

A typical scraping-route middleware:

import { Foil, safeVerifyFoilToken } from "@abxy/foil-server";

const client = new Foil({ secretKey: process.env.FOIL_SECRET_KEY });

app.get('/api/products/:id', async (req, res, next) => {
  // Verified search crawlers are allow-listed upstream (layer 2)
  const result = safeVerifyFoilToken(req.headers['x-foil-token'], process.env.FOIL_SECRET_KEY);
  if (!result.ok) return next();

  const { decision, visitor_fingerprint, session_id } = result.data;
  if (decision.verdict !== 'bot') {
    return next();  // humans and inconclusive sessions, allow
  }

  const session = await client.sessions.get(session_id);
  const labels = session.attribution.labels;

  const isIdentifiedAgent = labels.some((l) => l.kind === 'product' || l.kind === 'provider');
  if (isIdentifiedAgent && req.session?.userId) {
    return next();  // authenticated user with attached agent, allow
  }

  // Named framework or unidentified automation: rate-limit by device, not IP
  if (visitor_fingerprint && await rateLimit.check(visitor_fingerprint.id, 60)) {
    return serveCachedOrThrottled(req, res);
  }

  return next();
});

The decision provides the evidence, and the policy itself lives in application code where it belongs.

For the price-scraping-specific patterns, see price scraping. For the network-layer detail, datacenter proxy detection.