Orux AI
Blog
Engineering

Designing multi-channel failover

May 20261 min

The retry, circuit-breaker and bulkhead policies that route around a 30-second OpenAI outage without page errors.

A multi-channel gateway lives or dies on its failover logic. Here is the design that keeps Orux AI online when an upstream provider isn't.

Three layers

Layer 1: retries. Idempotent failures (429, 5xx, IOException) get up to 2 retries with exponential backoff (200ms × 2.5^attempt + jitter). Non-idempotent failures (400) fail fast.

Layer 2: circuit breaker. A per-channel Resilience4j breaker opens after 5 failures in 30s, halting traffic for 30s. While open, requests skip to the next channel in the model id's priority list.

Layer 3: bulkhead. Each channel has its own WebClient pool and concurrency limit, so a slow provider doesn't starve the others.

Why this is enough

Most provider outages are short. The combination of retries (handles a flap of a few seconds) and circuit breaker (handles minutes) covers >99% of real-world events. The bulkhead ensures one bad provider can't bring down the whole gateway.

Blog© 2026 Orux AI