Most developers assume this flow:

User → Pay with Stripe
Stripe → Webhook → Your backend
Backend → Confirm booking

But in distributed systems, webhooks are unreliable by design. They are just event notifications. So what happens if the webhook never arrives?

Normal Payment Flow

User
  |
Checkout
  |
Stripe Payment
  |
Webhook: payment_intent.succeeded
  |
Booking Service
  |
DB → booking confirmed

Everything works. But now imagine a failure.

Failure Scenario

Stripe Payment  ✓ SUCCESS
Webhook         ✗ DELIVERY FAILED

Now the system state becomes inconsistent.

Component State
Stripe Payment succeeded
Redis lock Seat reserved
Database Booking still pending

The user paid. But your system doesn’t know yet.

Why Webhooks Fail

Common reasons:

Server downtime
Deployment restart
Network timeout
Wrong webhook URL
Slow processing
Signature verification failure

Even if your server returns HTTP 500 once, Stripe will retry later. Stripe retries webhooks for up to 3 days with exponential backoff. But retries introduce delays and edge cases.

The Resilience Pattern

Good payment systems never rely only on webhooks. They use two layers.

1. Fast Path: Webhook

Webhook confirms payment immediately.

Stripe → webhook → confirm booking

Fast and efficient.

2. Recovery Path: Reconciliation Job

Background job checks pending bookings.

pending bookings
     |
check Stripe API
     |
update DB

Example logic:

for booking in pending_bookings:
    payment = stripe.get(payment_id)

    if payment.status == "succeeded":
        confirm_booking()

This ensures eventual consistency.

Additional Safeguards

Idempotent Booking

Prevent duplicate booking updates.

booking_id unique

Metadata Linking

Stripe checkout includes:

metadata:
  booking_id: "bk_abc123"

This allows the reconciliation job to match payments back to bookings without ambiguity.

Monitoring

Alert when webhook failures spike.

if webhook_error_rate > threshold:
    alert(oncall)

Otherwise failures stay silent.

The Real Lesson

Webhooks are notifications, not guarantees. A resilient system always assumes:

events may be delayed
events may be duplicated
events may be lost

Your architecture must recover from all three.

System Design Takeaway

Every external integration needs:

Fast path
+
Recovery path

For payments:

Webhook
+
Reconciliation job

Without the second layer, your system will eventually lose money or users.