Automated traffic crossed 50% of the internet in 2024 and has not come down since. The composition has also changed: a much larger share of bots now run inside real Chromium browsers driven by AI agents or stealth automation, rather than the simple HTTP scripts that dominated the bot landscape for the previous decade. This article is a working overview of what bot detection looks like in that environment, and what still works.

If you want the practical version, the implementation-focused companions are how to detect bot traffic, headless browser detection for the umbrella picture, the framework-specific posts on Selenium, Puppeteer and Playwright, and AI agent detection for the agentic-browser case.

A definition

Bot detection is the set of techniques a server uses to decide whether a request was initiated by software acting on its own behalf, rather than by a human acting through a browser. The detection answers a binary question (bot or not) and, in modern systems, also tries to classify the bot (which framework, which intent, verified or unverified, friendly or hostile).

That definition cuts cleanly across the use cases. A bot detector at the login route is trying to keep credential-stuffing scripts out. A bot detector at the search endpoint is trying to keep scrapers out. A bot detector at checkout is trying to keep card-testing out. The fingerprints, behavior signals, and policies differ, but the underlying question is the same.

The shape of the problem in 2026

Three numbers from the 2025 Imperva Bad Bot Report are worth carrying around:

  • 51% of web traffic is automated for the first time on record, including both good and bad bots (Imperva, 2025 Bad Bot Report PDF).
  • 37% of all web traffic is bad bots, up from ~30% in 2023.
  • 45% of bot attacks are now classified as “simple” rather than advanced, because the floor for launching one has collapsed thanks to AI-powered automation tooling.
Data 2024 internet traffic composition
51%
automated
  • Human 49%
  • Bad bots 37%
  • Good bots (verified crawlers, monitors) 14%

Sixth consecutive year of bad-bot growth. The 51% headline is the first time automated traffic outweighed human traffic on the public internet.

The composition story matters more than the headline number. The growth is coming not from old-school content scrapers running Python’s requests but from:

  • AI agents using a real browser via the Chrome DevTools Protocol (Browser Use, Anthropic Computer Use, OpenAI Operator).
  • Stealth-modded Puppeteer and Playwright running headed Chrome under Xvfb on Linux servers.
  • Anti-detect browsers like Multilogin, Octo, Dolphin and AdsPower, used to run many “real” browser profiles in parallel against a single target.
  • Residential and mobile proxy networks that put bot traffic behind genuine consumer IP addresses.

Every one of those is harder to detect than a 2018-era Selenium script.

Why user-agent rules and rate limits miss this

The default homegrown bot defense is some combination of these:

  • Block known-bad user agents.
  • Block requests with no User-Agent header.
  • Rate-limit IP addresses with too many requests per minute.
  • Geofence countries you do not serve.
  • Reject requests without a referer.

These rules are not useless. They catch the easiest 30 to 40% of bot traffic and they cost almost nothing. They miss everything sophisticated, for predictable reasons.

User-Agent is freely set. Any HTTP client can send any User-Agent string. Setting Chrome’s UA on a Python script takes one line of code.

No-UA is a 2010 defense. Modern stealth tooling sets a perfectly plausible Chrome UA by default. There is nothing left to block.

Rate limits assume one IP per actor. A residential proxy network rotates IPs every request. A single attacker can run 10,000 sessions per minute, each from a different residential IP, none of which exceeds a sensible per-IP rate limit.

Geofencing punishes real users. Travelling customers, VPN users, and users with privacy-conscious DNS settings all show up in countries the business does serve. A geofence that catches bots also catches them.

Referer is optional. Real browsers omit Referer in many legitimate contexts. Bots send whatever Referer the operator tells them to.

The deeper problem is that these rules all look at one signal in isolation. The economics of bot operations are such that any rule defined on one signal will be defeated as soon as it costs more than a few minutes of automation work to do so.

What modern bot detection actually looks at

Production bot detection is layered. Every reputable vendor (Cloudflare, DataDome, HUMAN, Foil) collects from the same four broad categories, and the differences between them are about depth, accuracy and how the verdict is delivered.

1. Network and infrastructure signals

The earliest signals, available before any application code runs.

  • IP and ASN. Is the IP from a datacenter, a residential ISP, a mobile carrier, a known VPN, or a known proxy network? AWS, GCP, Azure, OVH, Hetzner, DigitalOcean and a long tail of hosting providers are easy to identify by ASN. We cover this in datacenter proxy detection.
  • TLS fingerprint (JA3, JA4, JA4+). The ClientHello shape exposes the underlying TLS library. Real Chrome looks different from Python’s requests, which looks different from Go’s net/http, which looks different from a stealth-patched curl-impersonate build. JA4 in particular has become the standard since 2023 when Chrome started randomising extension order to defeat naive JA3 fingerprinting (Cloudflare: JA3/JA4 fingerprint).
  • HTTP/2 and HTTP/3 frame settings. The order and contents of the SETTINGS frame, the priority signaling, and the way the client handles server-pushed streams differ between client libraries.

2. Browser-level signals

What the browser exposes through HTTP headers and JavaScript APIs.

  • HTTP header set, ordering and casing. Real Chrome sends headers in a specific order. Python requests sends them differently. curl does its own thing. Header order is a separate signal from header contents.
  • Sec-Fetch headers. Sec-Fetch-Site, Sec-Fetch-Mode, Sec-Fetch-Dest. Most low-effort automation does not set these correctly. A real POST /login request from a form submission sends Sec-Fetch-Site: same-origin and Sec-Fetch-User: ?1.
  • Client Hints. Sec-CH-UA, Sec-CH-UA-Platform, the high-entropy hints when requested. Inconsistencies between Client Hints and the legacy User-Agent string catch sloppy spoofing.
  • navigator.webdriver. Set to true automatically by Selenium and by default Puppeteer configurations. Setting it back to false requires a patch the stealth plugins apply, but the patch itself is detectable.
  • Headless artefacts. Default Puppeteer ships with navigator.userAgent containing “HeadlessChrome”, navigator.plugins.length === 0, navigator.languages === [], chrome.runtime absent. Patched stealth Puppeteer fixes most of these; the patches themselves leave traces.
  • Canvas, WebGL, AudioContext fingerprints. Covered in canvas fingerprinting. Bot operators either run on real GPUs (expensive to scale) or spoof the output (detectable through cross-checks).

3. Behavioral signals

What only a human-driven session produces consistently.

  • Pointer dynamics. Mouse-move trajectories with characteristic noise. Headless scripts produce straight lines or perfectly Bezier-smooth arcs.
  • Scroll patterns. Variable velocity with deceleration, occasional reverse scrolls, scroll-then-pause sequences. Bots scroll mechanically or do not scroll at all.
  • Keystroke timing. Inter-key intervals follow a distribution with characteristic dwell-and-flight ratios per language. Synthetic keystroke events do not.
  • Touch dynamics on mobile. Pressure curves, touch area, the geometry of multi-touch gestures.
  • Idle and focus transitions. Real users tab away, come back, click somewhere, type a bit, scroll. Bots stay on task.

4. Account and reputation signals

When you have history.

  • Has this device been seen before, and how? The device intelligence layer covered in what is device fingerprinting.
  • Cluster behavior. Is this device part of a cluster of devices all hitting the same endpoint with similar timing? Cluster signals catch farms that look fine individually.
  • Account and IP graph. What other accounts has this device touched? What other devices have touched this IP?

How to actually detect bot traffic

A practical detection pipeline combines those signals in order of cost and signal-to-noise ratio.

Step 1: Cheap network signals at the edge. Before any application code runs, evaluate the IP, ASN, TLS fingerprint and HTTP/2 settings. If the request claims to be Chrome but the TLS fingerprint says it is Python’s urllib3, you are done. Block at the edge.

Step 2: Browser-level signals via SDK. A small JavaScript SDK collects the fingerprint, the Client Hints, the navigator checks, and the active probes (canvas, WebGL, audio). Streamed to the server in real time.

Step 3: Behavioral signals over the session window. The SDK keeps collecting pointer, scroll, touch and timing data while the page is open. Server-side scoring updates as evidence accumulates.

Step 4: Cross-checks across layers. Every layer produces evidence. The scoring system checks for consistency: TLS vs User-Agent, Client Hints vs JavaScript environment, time zone vs IP geography, claimed GPU vs canvas output. Each contradiction is a signal in its own right.

Step 5: Verdict. The combined score, with explanations of which signals fired and why, drives an action: allow, throttle, challenge, log, block.

The reason this pipeline beats a one-signal rule is that an attacker who can defeat one layer still has to defeat the others consistently. Modern stealth tooling can patch navigator.webdriver, but it does not also patch the TLS fingerprint, the Sec-Fetch headers, the canvas output, the mouse trajectories, and the IP reputation all at once. That requirement for consistency across every layer is the real moat.

What “bot detection software” usually means

The bot detection vendor landscape splits along three axes.

Edge bot detection. Cloudflare Bot Management, AWS WAF Bot Control, Akamai Bot Manager. Sits at the CDN layer, has the freshest network-level data, blocks at the edge. Weak on use-case-specific behavioral signals because it does not run inside your app.

Bot management platforms. DataDome, HUMAN, Kasada, Arkose. Mix of edge and in-app signals. Heavy on managed rulesets for account takeover, scraping, ad fraud. Designed for security teams.

Device intelligence platforms. Sardine, SEON, Foil. Built around a JavaScript or native SDK that collects rich device evidence; bot detection is one verdict among several (the others being device identity, risk scoring, account abuse classification). Designed for developers.

These categories overlap. The right choice depends on how much detection logic you want to own versus rent, and on which other use cases (fake accounts, payment fraud, scraping) you need to address from the same SDK.

Verified bots, good bots, and the search problem

Not every bot is bad. Googlebot, Bingbot, Applebot, GPTBot, ChatGPT-User, Claude, Perplexity-User and a long list of AI crawlers exist with legitimate reasons to fetch your content. A bot detection system that blocks them indiscriminately will damage search visibility and AI discoverability.

Two mechanisms exist for verifying that a bot is who it claims to be.

Reverse DNS verification. The traditional approach. Googlebot’s IP reverse-resolves to a *.googlebot.com hostname; that hostname’s forward resolution returns the IP. Implementing this correctly avoids spoofed Googlebot traffic but requires per-vendor implementation and breaks when vendors rotate IP ranges.

Web Bot Auth. A newer standard, currently in draft at the IETF, that allows bots to cryptographically sign their requests. Cloudflare and others are implementing it. Each bot operator publishes its public key, and verified bots sign each request using Signature-Input and Signature HTTP headers (Cloudflare: Web Bot Auth). For sites that want to allow AI agents but verify their identity, this is the direction the standard is moving.

A useful bot detection product distinguishes verified-good bots, unverified-but-classified bots, and unverified-malicious bots. Treating them all the same is the most common reason a real bot management deployment hurts the business it is supposed to protect.

What changes with AI agents

AI agents are a different category from traditional bots. They are software, they are automated, but they are usually:

  • Initiated by a logged-in human user authorizing the agent to act on their behalf.
  • Running in a real Chromium browser, with a real fingerprint, against your real product.
  • Performing high-trust actions: signups, purchases, customer support interactions.

A bot detection system designed around “block all automation” will block authorized AI agents along with malicious ones. The mature posture is to classify and attribute rather than block. Is this agent identified (signed Web Bot Auth, known driver, known headless config)? Is it authorized by the user (linked to a known account, with a recent step-up)? Is it doing what the user would have done?

We cover this case in detail in AI agent detection.

How Foil approaches it

Foil’s SDK collects across all five layers (HTTP, TLS, JS environment, active probes, interaction) and streams the signals to a sealed server-side scorer. Each session produces three outputs: a visitor fingerprint (visitor_fingerprint.id) that persists across resets, a verdict (bot, human, or inconclusive) with a risk score, and attribution labels naming the headless framework, anti-detect browser, or AI agent when one is identified.

The attribution is named. “Bot” by itself is not a useful answer for an operations team. “Headless Chrome via stealth-patched Puppeteer, from Hetzner, with a TLS fingerprint that does not match the claimed browser” is a useful answer, and it is the level of attribution a modern bot detection product should be delivering.

If you want to go deeper, the next reads are bot management for the policy layer above detection and bot mitigation for the action layer.

Further reading