Article

Webhook Observability: What 'Did This Event Get Delivered?' Actually Requires

By Aylon·May 29, 2026·6 min read

Observability is the difference between a webhook platform you can operate and one you can't. Here's the queryable surface it needs.

Customers ask one question over and over: did event X get delivered? Every minute spent answering that question is a minute not spent building. A webhook platform's observability surface is the difference between a system you can operate and one that operates you.

The questions that need answers

For every event in the system, customers (and your team) need to answer:

Did this event get accepted at ingest?
Which destinations was it routed to?
For each destination, what was the latest delivery attempt's status?
How many attempts did each delivery require?
If it failed, what was the response, code, body, latency?
If it succeeded, when?
Was it part of a replay, or an original delivery?

Each of these has a hard version and an easy version. Hard: grep through application logs to find them. Easy: a queryable record per attempt with structured fields.

What a queryable surface looks like

Per attempt, the platform should expose:

Event ID, event type, customer external ID
Destination ID, destination type
Attempt number (1, 2, 3, …)
Timestamp (start, end)
Outcome (success, transient_failure, permanent_failure)
Status code (for HTTP destinations)
Response body excerpt (truncated, for diagnosis)
Latency milliseconds
Error category if applicable (timeout, connection_refused, dns_failure, auth_failure, payload_rejected)
Whether this was an original delivery or part of a replay job

That's enough to answer almost every question without diving into application logs. The records should be queryable by event ID, by destination, by customer, by time window, and by outcome.

Per-destination health rollups

Per-attempt records are essential but verbose. For day-to-day operations, the customer needs a rolled-up view per destination:

Current health (healthy, degraded, failing) based on recent attempts
Last successful delivery timestamp
Last failure timestamp
Failure rate over last 1h / 24h / 7d
Average latency
Active retry count (how many attempts are mid-retry right now)

Health is a function of recent attempts: if 95%+ succeed in the last hour, healthy. If 50–95%, degraded. If less than 50%, failing. The exact thresholds are tunable, but the principle holds: customers shouldn't have to compute their own destination health.

Customer-visible vs internal

Some observability is for your customers (their deliveries, their destinations). Some is for your team (system-wide latency, queue depth, worker health). The same underlying data feeds both, but the UI surfaces them differently.

A customer should see: their events, their destinations, their delivery logs, their replays. Scoped queries enforced by the API layer.

Your team should see: aggregate latency, queue depth, worker utilization, error rates by destination type, top failing destinations by tenant, and per-tenant volume to spot anomalies.

The mistake to avoid: surfacing internal metrics to customers (confusing) or building two separate logging pipelines (expensive). One pipeline, two surfaces.

What good operator tooling looks like

For your team, the operator surface needs to handle the common questions in seconds:

"Customer A says deliveries are failing." Search by customer ID, filter to last hour, group by destination, look for outliers.
"Destination type X seems slow." Group by destination type, plot p50/p95/p99 latency over time, look for regression.
"A specific delivery, was it processed?" Search by event ID or idempotency key, see the full attempt history.
"Why did event Y land in the DLQ?" Click the event, see attempt history with response bodies, find the final-attempt error.

Each of these should take less than a minute, not hours. If your operator tooling makes routine questions hard, your team spends its time on support tickets.

Building it vs using it

Building the observability surface is roughly a quarter of building a webhook platform. The per-attempt log store (durable, queryable, retention-bounded), the rollup pipelines, the UI for customer-visible logs, the UI for operator triage, the alerting on degraded destinations, the export-for-compliance feature: each is a few engineering weeks.

Pushrail's delivery layer ships this as the baseline, not the upsell. Customer-facing delivery logs are visible in the embedded portal; operator triage is in the admin UI; per-attempt records are queryable via API.

Next in the Reliability cluster: webhook signing, done right, the receiver can trust your events. Done wrong, you have a security incident waiting.

Webhook Observability: What 'Did This Event Get Delivered?' Actually Requires

The questions that need answers

What a queryable surface looks like

Per-destination health rollups

Customer-visible vs internal

What good operator tooling looks like

Building it vs using it

Continue reading

Webhook Retries: A Practical Guide to Backoff, Jitter, and When to Stop

Dead-Letter Queues for Webhooks: Design and Recovery Patterns

Ready to stop building delivery infrastructure?