Webhook Observability: What 'Did This Event Get Delivered?' Actually Requires
Observability is the difference between a webhook platform you can operate and one you can't. Here's the queryable surface it needs.
Customers ask one question over and over: did event X get delivered? Every minute spent answering that question is a minute not spent building. A webhook platform's observability surface is the difference between a system you can operate and one that operates you.
The questions that need answers
For every event in the system, customers (and your team) need to answer:
- Did this event get accepted at ingest?
- Which destinations was it routed to?
- For each destination, what was the latest delivery attempt's status?
- How many attempts did each delivery require?
- If it failed, what was the response, code, body, latency?
- If it succeeded, when?
- Was it part of a replay, or an original delivery?
Each of these has a hard version and an easy version. Hard: grep through application logs to find them. Easy: a queryable record per attempt with structured fields.
What a queryable surface looks like
Per attempt, the platform should expose:
- Event ID, event type, customer external ID
- Destination ID, destination type
- Attempt number (1, 2, 3, …)
- Timestamp (start, end)
- Outcome (success, transient_failure, permanent_failure)
- Status code (for HTTP destinations)
- Response body excerpt (truncated, for diagnosis)
- Latency milliseconds
- Error category if applicable (timeout, connection_refused, dns_failure, auth_failure, payload_rejected)
- Whether this was an original delivery or part of a replay job
That's enough to answer almost every question without diving into application logs. The records should be queryable by event ID, by destination, by customer, by time window, and by outcome.
Per-destination health rollups
Per-attempt records are essential but verbose. For day-to-day operations, the customer needs a rolled-up view per destination:
- Current health (healthy, degraded, failing) based on recent attempts
- Last successful delivery timestamp
- Last failure timestamp
- Failure rate over last 1h / 24h / 7d
- Average latency
- Active retry count (how many attempts are mid-retry right now)
Health is a function of recent attempts: if 95%+ succeed in the last hour, healthy. If 50–95%, degraded. If less than 50%, failing. The exact thresholds are tunable, but the principle holds: customers shouldn't have to compute their own destination health.
Customer-visible vs internal
Some observability is for your customers (their deliveries, their destinations). Some is for your team (system-wide latency, queue depth, worker health). The same underlying data feeds both, but the UI surfaces them differently.
A customer should see: their events, their destinations, their delivery logs, their replays. Scoped queries enforced by the API layer.
Your team should see: aggregate latency, queue depth, worker utilization, error rates by destination type, top failing destinations by tenant, and per-tenant volume to spot anomalies.
The mistake to avoid: surfacing internal metrics to customers (confusing) or building two separate logging pipelines (expensive). One pipeline, two surfaces.
What good operator tooling looks like
For your team, the operator surface needs to handle the common questions in seconds:
- "Customer A says deliveries are failing." Search by customer ID, filter to last hour, group by destination, look for outliers.
- "Destination type X seems slow." Group by destination type, plot p50/p95/p99 latency over time, look for regression.
- "A specific delivery, was it processed?" Search by event ID or idempotency key, see the full attempt history.
- "Why did event Y land in the DLQ?" Click the event, see attempt history with response bodies, find the final-attempt error.
Each of these should take less than a minute, not hours. If your operator tooling makes routine questions hard, your team spends its time on support tickets.
Building it vs using it
Building the observability surface is roughly a quarter of building a webhook platform. The per-attempt log store (durable, queryable, retention-bounded), the rollup pipelines, the UI for customer-visible logs, the UI for operator triage, the alerting on degraded destinations, the export-for-compliance feature: each is a few engineering weeks.
Pushrail's delivery layer ships this as the baseline, not the upsell. Customer-facing delivery logs are visible in the embedded portal; operator triage is in the admin UI; per-attempt records are queryable via API.
Next in the Reliability cluster: webhook signing, done right, the receiver can trust your events. Done wrong, you have a security incident waiting.