Webhook Retries: A Practical Guide to Backoff, Jitter, and When to Stop
Webhook retries sound simple until you ship them. Here's how to think about backoff, jitter, transient vs permanent failures, and when to stop retrying.
Every webhook sender hits the same problem on day one: what do you do when a delivery fails? The answer is retries, but designed badly, retries cause more outages than they fix. This is a practical guide to getting them right.
Why retries matter
Most webhook receivers aren't always up. Their server is restarting. Their network had a hiccup. They deployed a bad version and rolled it back two minutes later. If your sender treats every failure as final, your customers lose data every time one of these things happens. If your sender retries forever, you DDoS the receiver as soon as they have a real outage.
The job of a retry policy is to thread that needle: recover from transient failures cheaply, give up on permanent failures fast, and never make a receiver's bad day worse.
Step one: classify failures
Not every failure deserves a retry. Before you write retry logic, classify failures into transient and permanent.
Transient failures are likely to succeed on a later attempt:
- 5xx server errors (the receiver crashed or is restarting)
- Connection timeouts (network blip, slow process)
- Connection refused / reset (receiver process is down)
- 408 Request Timeout
- 429 Too Many Requests (you got rate-limited)
Permanent failures will not succeed without intervention:
- 4xx errors other than 408 and 429 (the request is wrong, auth, payload, route)
- TLS handshake failures (cert is bad)
- DNS resolution failures (the hostname doesn't exist)
For permanent failures, retry is wasted work and can mask real problems. Send them straight to a dead-letter queue for the customer to inspect.
Step two: exponential backoff with jitter
Once you know a failure is transient, the question is when to retry. The answer is "later, with a delay that grows exponentially, plus randomness."
Exponential growth means each attempt waits roughly twice as long as the last. Typical sequence: 1s, 2s, 4s, 8s, 16s, 32s, 1m, 2m, 5m, 15m, 30m, 1h, 2h, 4h, 8h, 16h. The pattern lets you retry quickly when it might be a brief blip, and back off gracefully when it's a real outage.
Jitter adds random variation (typically ±25%) so a thousand events that failed at the same time don't all retry at the same time. Without jitter, every failure cluster becomes a thundering herd that hits the receiver the moment they come back up. With jitter, the retries arrive smeared across a window.
Step three: cap the retry window
Retries can't go forever. After some maximum window, typically 24 hours, the delivery is dead. Send it to a DLQ for the customer to inspect and replay manually.
24 hours is the right default because it covers an overnight outage on the receiver's side: their team rolls in the next morning, sees the failure in their log, fixes the endpoint, and triggers a replay. Going much longer adds little value (a receiver that's been down for two days has a bigger problem than retries can solve) and risks delivering very stale events that downstream systems might not expect.
Step four: rate limits are not failures, they're back-pressure
A 429 from the receiver is a signal, not an error. The receiver is telling you "slow down." Respect it: parse the Retry-After header if present, otherwise back off more aggressively than normal exponential.
For destinations you control rate limiting on (warehouses, queues), batch and concurrency-limit at the adapter level so you don't trigger 429s in the first place.
Step five: surface the state
Retries are invisible by default, they happen inside your worker. That's a problem for your customers. They want to know if event X reached destination Y, and if not, why. Every retry attempt should be logged with: timestamp, attempt number, response code or error type, and latency. The log feeds the customer-facing delivery view.
If a delivery sits in retry purgatory for an hour, the customer should be able to see that and decide whether to wait or intervene. Hiding the retry state is worse than not retrying at all.
What to avoid
- Constant-delay retries. Retrying every 5 minutes for 6 hours hammers the receiver during an outage. Use exponential backoff.
- No jitter. Thundering-herd retries take down receivers that were about to recover.
- Retrying 4xx. Auth failures don't fix themselves. Sending the same bad request 10 times helps nobody.
- Infinite retries. Some delivery has to be the last delivery. After 24 hours, escalate to a DLQ.
- Silent failures. If retries happen but the customer can't see them, you've built a system that quietly loses data.
The summary
Classify failures, retry transients with exponential backoff and jitter, give up on permanents fast, cap the retry window at ~24 hours, respect 429s as back-pressure, and surface the state so the customer can see what's happening.
That's the pattern. Pushrail implements it as the baseline, see the event delivery layer for how it shows up across all 18 destination types, or read webhook replay for what happens when retries do eventually exhaust.