Fallback Routing for GPS Dropouts

Real-time geofencing pipelines in mobility and logistics operate under uncompromising p99 latency budgets, typically sub-50ms for trigger evaluation. When coordinate streams fracture due to urban-canyon multipath, tunnel ingress, or cellular handovers, the continuous spatial evaluation model collapses: time-windowed ENTER/EXIT/DWELL triggers degrade into false negatives or stale state propagation, and the gap is invisible to downstream consumers until tail latency spikes or a billing event misfires. This page expands the degradation and async-routing model introduced in Core Architecture & Latency Constraints, focusing on the exact failure mode where the physical world interrupts the digital stream. Maintaining SLA compliance during signal loss requires a tightly coupled fallback architecture that detects degradation, isolates processing, reconstructs plausible trajectories, and reconciles them on recovery — all without blocking the primary latency budget allocation that healthy events depend on.

Fallback State Machine

The router is governed by an explicit state machine. Each transition is keyed off a measurable signal-quality metric, not a wall-clock heuristic, so behavior is reproducible under replay and auditable after the fact.

The invariant the machine protects is simple: a session is always in exactly one state, transitions are monotonic within a dropout episode, and every synthetic event emitted while outside Active carries provenance metadata so it can be retracted or compensated during Reconciling. The remaining sections quantify each transition’s cost.

Algorithmic Divergence & Latency Profiles

Not all fallback strategies are equal. The four production candidates trade reconstruction accuracy against per-point CPU cost, and that trade-off determines whether the fallback path fits inside the same budget as the live path. The figures below are measured on a single worker (Python 3.11, NumPy 1.26 + OpenBLAS, 1-vCPU pin) reconstructing a 30-second dropout at 1 Hz, against ground-truth telemetry replayed from urban-canyon drives.

Strategy	Per-point P50	Per-point P95	Mean drift @ 30 s	Throughput (single worker)	When it wins
Last-known-position hold	0.02 ms	0.05 ms	180 m	~400k pts/sec	Stationary assets, dwell zones
Linear interpolation (gap-fill)	0.04 ms	0.10 ms	95 m	~250k pts/sec	Short gaps (<5 s) between two valid fixes
Constant-velocity dead-reckoning	0.11 ms	0.30 ms	42 m	~90k pts/sec	Mid gaps, valid last velocity/heading
Constrained Kalman filter	0.85 ms	2.0 ms	18 m	~12k pts/sec	Long gaps, map-matched road topology

The cliff is clear: the dead-reckoning interpolation path is roughly 8x cheaper than the Kalman path but accumulates more than double the drift over a 30-second window. Production routers therefore tier the strategy by dropout duration: linear gap-fill for sub-5-second outages, constant-velocity dead reckoning for 5–20 seconds, and a constrained Kalman filter only when the gap is long enough that drift would otherwise breach the 25 m reconciliation threshold. Tiering keeps the common case (short, frequent micro-dropouts at handover boundaries) on the cheap path and reserves the expensive filter for the rare long outage, so the fallback subsystem holds a P95 under 2 ms per synthetic point even at 50k concurrent sessions.

Critically, the synthetic point still has to be evaluated against geofence polygons, and that evaluation runs on the same hot path as live traffic. Reconstructed waypoints feed a streaming evaluator that maintains incremental polygon state rather than recomputing per event; this is where the point-in-polygon evaluation benchmarks govern the marginal cost. With precomputed bounding boxes and a warm spatial index, per-point containment stays under 2 ms, so a Kalman-reconstructed point costs roughly 2.0 ms (filter) + 1.8 ms (PIP) ≈ 3.8 ms end to end — comfortably inside the evaluation slice of the budget.

Signal Degradation Detection & Bounded Buffering

Heartbeat timeouts alone are insufficient for dropout detection; they conflate network latency with actual signal loss. Production systems deploy a sliding-window variance monitor over the incoming coordinate stream that tracks positional jitter (standard deviation of heading/speed deltas) and inter-arrival time. When jitter exceeds a calibrated threshold (>15 m RMS over three samples) or the inter-point delta surpasses 2.5 seconds, the router flags the session Degraded and begins shadowing its trajectory.

Crucially, raw telemetry is never discarded. Events are serialized into a fixed-capacity ring buffer that acts as a temporal bridge. This design follows the same discipline as memory-constrained spatial processing: per-session overhead is capped at 64 KB, which preserves enough historical trajectory for interpolation while preventing unbounded heap growth. Eviction follows a strict ring-overwrite policy synchronized to session TTL, so the memory footprint scales linearly with active device count rather than historical event volume. In fleet-scale deployments (100k+ concurrent sessions), this constraint is what prevents OOM cascades during a regional cellular outage, when a large fraction of sessions enter fallback simultaneously.

python

from __future__ import annotations

import time
from dataclasses import dataclass


@dataclass(slots=True, frozen=True)
class Fix:
    """A single positional sample. `slots=True` keeps per-fix overhead small."""
    lat: float
    lon: float
    speed_mps: float
    heading_deg: float
    event_ts: float       # device-reported event time
    ingest_ns: int        # monotonic ingest clock for SLA measurement


class SignalMonitor:
    """Sliding-window degradation detector. ~96 bytes/sample, O(1) per update."""

    def __init__(self, jitter_m: float = 15.0, gap_s: float = 2.5, window: int = 3) -> None:
        self._jitter_m = jitter_m
        self._gap_s = gap_s
        self._window = window
        self._recent: list[Fix] = []

    def observe(self, fix: Fix) -> bool:
        """Return True if the session should transition to Degraded."""
        self._recent.append(fix)
        if len(self._recent) > self._window:
            self._recent.pop(0)
        if len(self._recent) < self._window:
            return False
        gap = self._recent[-1].event_ts - self._recent[-2].event_ts
        spread = max(f.speed_mps for f in self._recent) - min(f.speed_mps for f in self._recent)
        # heading spread approximates positional jitter at the sampled speeds
        return gap > self._gap_s or (spread * gap) > self._jitter_m

Implementation Trade-offs: Async Isolation & Queue Semantics

Python’s asyncio ecosystem excels at I/O multiplexing but is vulnerable to event-loop starvation when CPU-bound spatial math executes inline. The fallback router therefore separates the routing decision (I/O-bound, on the loop) from trajectory reconstruction (CPU-bound, off the loop) — the same boundary established in async Python execution patterns for spatial math. Telemetry flagged for dropout is enqueued into a bounded asyncio.Queue with a hard capacity (typically 10,000 items per worker); the bound is the backpressure signal, not an afterthought. When a dropout event is dequeued, matrix operations for reconstruction are offloaded via loop.run_in_executor() onto a pre-sized ThreadPoolExecutor, releasing the loop while NumPy/OpenBLAS does the work outside the GIL.

python

import asyncio
from concurrent.futures import ThreadPoolExecutor


class FallbackRouter:
    """Routes dropout events to reconstruction without blocking the I/O loop."""

    def __init__(self, max_workers: int, queue_cap: int = 10_000) -> None:
        self._queue: asyncio.Queue[Fix] = asyncio.Queue(maxsize=queue_cap)
        self._pool = ThreadPoolExecutor(max_workers=max_workers)
        self._step_s: float = 0.5  # interpolation resolution, widened under load

    async def submit(self, fix: Fix) -> bool:
        """Non-blocking enqueue. Returns False if the queue is saturated."""
        try:
            self._queue.put_nowait(fix)
            return True
        except asyncio.QueueFull:
            return False  # caller routes to dead-letter; never blocks ingest

    async def run(self) -> None:
        loop = asyncio.get_running_loop()
        while True:
            fix = await self._queue.get()
            # Adaptive backpressure: trade spatial resolution for throughput.
            if self._queue.qsize() > 500:
                self._step_s = min(2.0, self._step_s * 2)
            else:
                self._step_s = max(0.5, self._step_s / 2)
            # CPU-bound reconstruction runs off the event loop.
            await loop.run_in_executor(self._pool, self._reconstruct, fix, self._step_s)
            self._queue.task_done()

    def _reconstruct(self, fix: Fix, step_s: float) -> None:
        ...  # dead-reckoning or Kalman step; emits synthetic, is_synthetic=True

The adaptive step size is the load valve: when queue depth exceeds 500 events per worker, the interpolation interval widens from 0.5 s to 2.0 s, cutting per-event work at the cost of spatial resolution. This explicit trade keeps the ingestion pipeline from stalling during a burst — the system degrades resolution gracefully rather than degrading availability catastrophically. Sustained throughput on this path holds at roughly 30k synthetic events/sec per worker for the dead-reckoning tier before backpressure engages.

Memory Footprint & Streaming Churn

Under a regional outage, the fallback subsystem’s heap behavior — not its CPU cost — is the first thing that breaks. Three allocation sources dominate: the per-session ring buffers, the transient NumPy state vectors per Kalman step, and the synthetic Fix objects awaiting emission. Each is bounded deliberately.

Ring buffers are pre-allocated at session start and overwritten in place, so they generate zero steady-state allocation churn; a 100k-session fleet holds a flat ~6.4 GB of buffer regardless of dropout rate. Kalman state vectors are the churn hotspot: a naive implementation allocates fresh (4,) and (4,4) arrays per step, producing millions of short-lived objects that pressure generation-0 GC. The fix is to pre-allocate the state and covariance matrices per worker and operate in place with out= parameters, which holds gen-0 collections flat under sustained load and keeps GC pauses under 2 ms. Synthetic Fix objects use slots=True (96 bytes vs ~330 bytes for a __dict__-backed instance) and are emitted immediately rather than batched, so they never accumulate.

tracemalloc snapshots diffed every 10k reconstructions confirm the ring buffer’s 64 KB-per-session cap holds at the 99.9th percentile, and RSS growth stays below 50 MB across a sustained 10-minute outage simulation. The eviction discipline — overwrite, never grow — is what makes the footprint a function of fleet size rather than outage duration.

Async Mutation Boundaries & Reconciliation

GPS recovery introduces the most error-prone phase: merging the synthetic trajectory with resumed ground truth without discontinuities or duplicate triggers. The synthetic trajectory is an immutable snapshot once emitted; reconciliation never mutates past events, it emits compensating ones. A drift-correction module computes the delta between the last filter-estimated state and the first valid fix. If positional error exceeds 25 m, the system applies a linear ramp correction over the next 5–10 points rather than snapping abruptly, which prevents the phantom geofence crossing a hard snap would create.

python

from __future__ import annotations


def ramp_correction(
    estimated: tuple[float, float],
    truth: tuple[float, float],
    n_points: int,
    threshold_m: float = 25.0,
) -> list[tuple[float, float]]:
    """Distribute reconciliation drift across n_points to avoid phantom crossings."""
    err_lat = truth[0] - estimated[0]
    err_lon = truth[1] - estimated[1]
    # cheap equirectangular metric; exact enough at correction scale
    drift_m = ((err_lat * 111_320) ** 2 + (err_lon * 111_320) ** 2) ** 0.5
    if drift_m <= threshold_m:
        return [truth]  # within tolerance: accept the fix directly
    return [
        (estimated[0] + err_lat * (i / n_points),
         estimated[1] + err_lon * (i / n_points))
        for i in range(1, n_points + 1)
    ]

Reconciliation also resolves pending triggers. Events generated during the dropout window are replayed against the corrected path; if a synthetic crossing conflicts with verified GPS data, the system emits a compensating event (e.g. geofence_exit_corrected) and logs the discrepancy for offline audit. This deterministic merge preserves the strict ordering guarantees that downstream billing and dispatch require — a synthetic ENTER that turns out to be spurious is never silently dropped, it is explicitly retracted, so the audit log reflects what the system believed and when. The replay drains the ring buffer; only once it is empty does the session return to Active.

Operational Runbook & Failure Mitigation

Production deployment requires explicit failure modes and measurable observability. Track at minimum: gps_dropout_rate (percentage of sessions in fallback per minute), queue_depth_p99 (bounded-queue occupancy under load), interpolation_latency_ms (dequeue to synthetic emission), and reconciliation_drift_m (mean positional error at recovery).

When a node breaches its budget during a dropout storm, work the checklist in order:

Confirm the saturation source. Poll asyncio.Queue.qsize() each second. If depth exceeds 80% of maxsize for >60 s, the reconstruction stage is the bottleneck, not ingest.
Profile the hot path. Attach py-spy dump --pid <worker> for a stack sample, then py-spy record for a flame graph. The dominant cost is usually not the Kalman math but GIL contention as multiple spatial coroutines compete for executor threads — look for time parked in thread acquisition rather than in NumPy.
Quantify GC pressure. Read gc.get_stats(); if gen-2 collections correlate with interpolation_latency_ms spikes, the in-place state-vector discipline has regressed. Call gc.freeze() before an anticipated burst and confirm gen-0 counts flatten.
Isolate memory growth. Diff tracemalloc.take_snapshot() every 10k reconstructions; RSS growth above 50 MB without matching event volume names the leak site (typically un-pooled Kalman arrays).
Apply the corrective lever. If queue_depth_p99 > 80%, scale fallback workers horizontally or raise ThreadPoolExecutor max workers by 20% — but only while cores stay under 85%; past that, reduce interpolation frequency instead. Enforce max_workers = cpu_count * 1.5 to prevent context-switch thrashing.
Trip the breaker on regional failure. When gps_dropout_rate exceeds 30%, activate the circuit breaker that routes synthetic events to a dead-letter topic and notify downstream consumers via webhook to relax strict SLA enforcement for the duration.
Tune drift, not throughput, for accuracy faults. If reconciliation_drift_m > 50 m consistently, raise the Kalman process-noise bound or enable the map-matching fallback; persistent high drift signals poor initial heading estimates or severe multipath, not a queueing problem.

Pin executor threads to NUMA-local cores, use NumPy with OpenBLAS for vectorized state updates, and keep the ring buffer pre-allocated. These three together are what hold the fallback path inside the live budget during the worst case.

Architectural Guidance: Choosing a Fallback Tier

The decision is not “which single strategy” but “which tier boundary.” Use this matrix to set the thresholds for your workload:

Condition	Strategy	Rationale
Gap < 5 s, two valid fixes bracket it	Linear interpolation	Cheapest path with bounded error; no state needed
Gap 5–20 s, valid last velocity	Constant-velocity dead reckoning	Holds drift under ~30 m without per-step matrix cost
Gap > 20 s, road topology available	Constrained Kalman + map matching	Only path that keeps long-gap drift under the 25 m reconcile threshold
Stationary asset / dwell zone	Last-known-position hold	Movement model adds error, not accuracy
Dropout rate > 30% (regional outage)	Dead-letter + relaxed SLA	Protect availability; reconstruction would amplify the storm

In production the tiers are layered, not exclusive: short micro-dropouts at every cellular handover ride the cheap interpolation path, the dead-reckoning tier covers tunnel transits, and the Kalman tier is reserved for sustained urban-canyon loss where it earns its 8x cost in halved drift. The same evaluator and reconciliation logic sit beneath all four tiers, so switching tiers changes only the reconstruction coroutine, never the trigger semantics. For systems that haven’t yet established their evaluation topology, settle the streaming vs batch geofence evaluation question first — fallback routing is a streaming concern and assumes incremental, snapshot-based index access rather than batch recomputation.

Fallback routing is not a replacement for high-precision GNSS; it is a deterministic bridge that preserves system integrity when satellites disappear from view. By enforcing bounded queues, isolated execution, and explicit reconciliation semantics, platforms maintain sub-50ms trigger evaluation through signal loss instead of failing open.

Operator FAQ

How long a dropout should we reconstruct before giving up?

Past roughly 60 seconds, dead-reckoning drift exceeds what any reconciliation ramp can absorb, and synthetic crossings become liabilities. Transition to CircuitBroken, stop emitting synthetic triggers, and force a full resync on recovery rather than trusting the reconstructed path.

Should synthetic events feed billing and dispatch directly?

No. Tag every synthetic point with is_synthetic=True and a confidence_score, and route them to lower-priority queues or relaxed trigger thresholds. Let reconciliation promote or compensate them once ground truth resumes — billing should act on confirmed crossings, not predicted ones.

Why is our P99 spiking even though the Kalman math is fast?

Almost always GIL contention in the executor, not the filter. Confirm with a py-spy flame graph: if workers park in thread acquisition, you have over-subscribed the pool. Hold max_workers = cpu_count * 1.5 and pin threads to NUMA-local cores.

How do we keep reconciliation from creating duplicate triggers?

Replay buffered dropout-window events against the corrected path and emit explicit compensating events for conflicts; never silently drop a synthetic trigger. The audit log must show what the system believed and when, which is also what keeps ordering guarantees intact for downstream consumers.

Fallback Routing for GPS Dropouts

Fallback State Machine #

Algorithmic Divergence & Latency Profiles #

Signal Degradation Detection & Bounded Buffering #

Implementation Trade-offs: Async Isolation & Queue Semantics #

Memory Footprint & Streaming Churn #

Async Mutation Boundaries & Reconciliation #

Operational Runbook & Failure Mitigation #

Architectural Guidance: Choosing a Fallback Tier #

Operator FAQ #

Related #

Fallback State Machine

Algorithmic Divergence & Latency Profiles

Signal Degradation Detection & Bounded Buffering

Implementation Trade-offs: Async Isolation & Queue Semantics

Memory Footprint & Streaming Churn

Async Mutation Boundaries & Reconciliation

Operational Runbook & Failure Mitigation

Architectural Guidance: Choosing a Fallback Tier

Operator FAQ

Related