Dead-Letter Queues for Webhooks: Design and Recovery Patterns
DLQs in HTTP delivery are different from DLQs in queue infrastructure. Here's what they should hold, how to inspect them, and what to do when they fill up.
Dead-letter queues are a queue-infrastructure idea borrowed wholesale into webhook delivery, and the borrowing isn't perfect. A webhook DLQ holds deliveries that failed for HTTP reasons, not messages that failed to enqueue. The semantics differ in ways that matter for design and recovery.
What belongs in a webhook DLQ
A delivery lands in the DLQ when retries have exhausted. That can mean:
- Every retry got a 5xx (the receiver was down for the entire retry window).
- The very first attempt got a permanent 4xx and was classified as terminal (auth, route, payload).
- The TLS handshake failed every time (cert is broken).
- The destination hostname doesn't resolve (DNS issue or typo).
In each case, the delivery is real and useful, the event happened, it routed to a destination, the producer accepted it. The only thing missing is a successful receiver response. That makes the DLQ a recovery surface, not a garbage can.
What doesn't belong
Things that should never reach the DLQ:
- Schema-rejected events at ingest. If the producer sent a malformed event, reject it at ingest with a 4xx. The producer fixes their code; nothing lands in the queue.
- Events with no matching destination. If routing yields zero destinations, that's a config issue, not a delivery issue. Surface it as a config warning instead.
- Duplicates caught by idempotency. Duplicates are discarded at ingest with a "duplicate" status. They never enter the delivery pipeline.
A clean DLQ contains only deliveries that genuinely tried and failed. Mixing in producer errors or config errors makes the DLQ noisy and hides the real failures.
What the DLQ should expose
Operationally, the DLQ is a surface the customer (and your team) needs to inspect. The minimum it should expose per entry:
- The original event (
eventType,occurredAt,customerExternalId, full payload). - The destination it was trying to reach.
- The attempt history: how many times it was tried, status codes, response bodies, latency per attempt.
- The classification of the final failure (transient retries exhausted, vs permanent).
- The time it landed in the DLQ.
- A replay action.
The attempt history is the part that's easy to skip and shouldn't be. When a customer asks "why did this fail?" the answer is usually in the response bodies of the last few attempts. If the DLQ only shows the final error, the customer is stuck filing a ticket.
Recovery patterns
The DLQ is where you go to recover from problems. Common patterns:
Customer fixed their endpoint. They redeployed, rotated the cert, fixed the auth, whatever. They want everything in the DLQ for their destination to replay. The action: select all entries for the destination, trigger replay, watch the progress.
Producer fixed their payload. They were sending a malformed field, the receiver was 400ing every time, the producer fixed the bug. The action: replay everything from a time window, with the option to apply current transforms (so the fix lands).
Selective replay. They fixed one bug but not another. They want to replay only events of type X. The action: filter the DLQ by event type, then replay the filter result.
Manual triage. They look at the DLQ and decide some events aren't worth replaying, they're stale, the underlying resource has been deleted, the business case has moved on. The action: bulk-delete entries (with audit trail) instead of replaying.
Each of these works against the same DLQ surface. The replay primitive handles the actual re-delivery; the DLQ provides the inspection surface.
When the DLQ fills up
A DLQ should have a retention window. Pushrail's default is the customer's plan-tier retention (Free: 24h, Base: 7d, Growth: 30d, Scale: 90d, Enterprise: configurable). Entries older than the window are automatically purged.
That's a feature, not a problem. A DLQ that grows forever becomes a graveyard. The retention window forces resolution: either replay or accept the loss.
When the DLQ approaches the retention limit, the customer should get an alert. Not "the DLQ has N entries" (which is noise), but "you have entries in the DLQ that will expire in N days." That's an action signal.
Alerts vs noise
DLQ alerting is tricky. Every webhook destination has some baseline failure rate, some receivers are flaky, some retries are normal, some 4xx errors are intentional (the customer is testing). Alerting on every DLQ entry buries the real signal.
The right alerts are:
- Sustained DLQ entry rate. More than N entries for the same destination within the same N minutes, something's wrong with the destination.
- Approaching retention expiry. Entries about to age out.
- First DLQ entry after a quiet period. The destination was healthy and now isn't.
What not to alert on: every entry, every transient failure that resolved itself on retry, every 4xx that was the customer's intentional rejection.
Building it vs using it
A production-grade DLQ is more than a list of failed deliveries. It needs the attempt-history view, the filter UI, the bulk-replay action, the retention controls, the alerting hooks, and the integration with the rest of the audit trail. Plan a few engineering weeks for a usable v1.
Pushrail ships the DLQ as part of the event delivery layer, same surface across all 18 destination types. The customer sees their failures, triages them, and triggers replay without filing a ticket.
That closes out the Reliability cluster of this series. Next we'll move beyond webhooks to look at delivering events to data warehouses, queues, and the rest of the destination types your customers actually want.