What Happens When Your Stripe Payment Webhook Fails
Most developers assume this flow:
User → Pay with Stripe
Stripe → Webhook → Your backend
Backend → Confirm booking
But in distributed systems, webhooks are unreliable by design. They are just event notifications. So what happens if the webhook never arrives?
Normal Payment Flow
User
|
Checkout
|
Stripe Payment
|
Webhook: payment_intent.succeeded
|
Booking Service
|
DB → booking confirmed
Everything works. But now imagine a failure.
Failure Scenario
Stripe Payment ✓ SUCCESS
Webhook ✗ DELIVERY FAILED
Now the system state becomes inconsistent.
| Component | State |
|---|---|
| Stripe | Payment succeeded |
| Redis lock | Seat reserved |
| Database | Booking still pending |
The user paid. But your system doesn’t know yet.
Why Webhooks Fail
Common reasons:
Server downtime
Deployment restart
Network timeout
Wrong webhook URL
Slow processing
Signature verification failure
Even if your server returns HTTP 500 once, Stripe will retry later. Stripe retries webhooks for up to 3 days with exponential backoff. But retries introduce delays and edge cases.
The Resilience Pattern
Good payment systems never rely only on webhooks. They use two layers.
1. Fast Path: Webhook
Webhook confirms payment immediately.
Stripe → webhook → confirm booking
Fast and efficient.
2. Recovery Path: Reconciliation Job
Background job checks pending bookings.
pending bookings
|
check Stripe API
|
update DB
Example logic:
for booking in pending_bookings:
payment = stripe.get(payment_id)
if payment.status == "succeeded":
confirm_booking()
This ensures eventual consistency.
Additional Safeguards
Idempotent Booking
Prevent duplicate booking updates.
booking_id unique
Metadata Linking
Stripe checkout includes:
metadata:
booking_id: "bk_abc123"
This allows the reconciliation job to match payments back to bookings without ambiguity.
Monitoring
Alert when webhook failures spike.
if webhook_error_rate > threshold:
alert(oncall)
Otherwise failures stay silent.
The Real Lesson
Webhooks are notifications, not guarantees. A resilient system always assumes:
events may be delayed
events may be duplicated
events may be lost
Your architecture must recover from all three.
System Design Takeaway
Every external integration needs:
Fast path
+
Recovery path
For payments:
Webhook
+
Reconciliation job
Without the second layer, your system will eventually lose money or users.