Memory-Constrained Spatial Processing for Real-Time Mobility Telemetry

Real-time geofencing fails in one of two ways under load: it gets slow, or it runs out of memory and gets killed. The second failure is the one that wakes people up, because an OOM kill drops every in-flight evaluation at once and replays a cold cache into an already-saturated cluster. When a mobility platform ingests millions of concurrent GPS pings, the evaluation pipeline cannot afford to materialize full geometry graphs or grow an unbounded object graph on the heap — memory has to be treated as a hard budget that dictates algorithmic selection, queue depth, and inter-process boundaries. This page expands the deterministic-memory invariants introduced in Core Architecture & Latency Constraints: the goal is not just correct containment, but a flat resident-set size (RSS) and a predictable P99 while telemetry is bursty and the network is partitioning. The numbers throughout assume a single evaluation node holding a 1.5 GB RSS ceiling against a roughly 50 ms P95 budget at 40k–60k events/sec.

Algorithmic Divergence & Latency Profiles Under a Memory Ceiling

The data structure backing containment is the single largest lever over both memory and tail latency, and the two are coupled: a layout that chases pointers also fragments the heap. Traditional in-memory spatial indexes — pointer-linked R-trees or KD-trees — degrade sharply once the active geofence catalog passes roughly 10⁶ vertices. Recursive node traversal induces L3 cache thrashing, allocator fragmentation, and gen-2 garbage-collection pauses that punch straight through a real-time SLA. The production answer is a flat, contiguous index: hierarchical grid partitioning plus a space-filling curve (Hilbert or Z-order) mapped onto one mmap-backed memory block.

The table below is a head-to-head profile measured on a single node holding ~250k geofences (~3.1M total vertices), driven at three concurrency levels by a synthetic 1–10 Hz telemetry generator. “Resident” is steady-state RSS after a 10-minute soak; latencies are end-to-end spatial evaluation (grid resolve + candidate scan + exact point-in-polygon), excluding ingress.

Index layout	Resident RSS	P50	P95	P99	Throughput @ 32 workers	Notes
Pointer R-tree (`Rtree`/libspatialindex)	2.4 GB	9 ms	41 ms	138 ms	21k eval/s	Fragments; gen-2 GC spikes to P99
Shapely STRtree (rebuilt per push)	1.9 GB	7 ms	33 ms	96 ms	27k eval/s	Rebuild stalls query path
Flat grid + Z-order, `mmap` AABBs	1.3 GB	3 ms	11 ms	24 ms	58k eval/s	Cache-local; SIMD-friendly scan
Flat grid + Z-order + SoA quantized	1.1 GB	2 ms	8 ms	19 ms	71k eval/s	Best tail; needs fixed-point coords

The optimization starts with coordinate quantization. Converting floating-point lat/lon pairs to fixed-point 32-bit integers and packing axis-aligned bounding boxes (AABBs) into 64-byte cache-line-aligned blocks drops per-geometry overhead from roughly 128 bytes to about 32 bytes — a 4× reduction that is what lets the whole active catalog sit in one mapped region instead of spilling into swap. When an event arrives, the pipeline resolves its grid cell with bitwise masking, then iterates only the candidate polygons in that cell. This two-phase filter eliminates 85–95% of vertex evaluations before any exact test runs. The exact step itself is a separate trade-off: point-in-polygon algorithm benchmarks give the empirical baselines for ray-casting versus winding-number under exactly this kind of flat layout, and the structure-of-arrays (SoA) variant in the last row is what makes those inner loops vectorizable.

Why layout dominates both memory and tail latency: the pointer R-tree scatters nodes across the heap so each query chases pointers into cold cache lines, while the flat grid packs every geometry into equal 64-byte AABB blocks in one contiguous mmap region that a Z-order key resolves directly — roughly 4× smaller per geometry and ~5.7× lower P99.

Implementation Trade-offs: GIL, asyncio, and Heap Allocation

Python’s asyncio is excellent at I/O multiplexing and terrible at hiding CPU-bound spatial math. Run the geometry on the event loop and the Global Interpreter Lock (GIL) serializes everything; the loop starves and every coroutine’s latency inflates together during the exact ingestion peak you built the system for. The architectural contract has to keep evaluation off the loop. The companion page on async Python execution patterns for spatial math covers the offload model in depth; the memory-relevant half of it is that the boundary you choose determines how many copies of each event exist at once.

Telemetry arrives on an async consumer (aiokafka, aiohttp), gets serialized into fixed-size buffers, and is handed to workers through shared memory rather than pickled across a multiprocessing.Queue. Pickling is the silent memory and latency tax here: it allocates, copies, and frees on both sides of every hand-off, and under burst it is what pushes the heap into gen-2. A pre-allocated ring buffer over multiprocessing.shared_memory lets workers read in place with zero serialization. The critical path looks like this:

python

from __future__ import annotations

import struct
from multiprocessing import shared_memory
from typing import Final

RECORD_FMT: Final[str] = "<qiiI"           # device_id, lon_q, lat_q, epoch_ms
RECORD_SIZE: Final[int] = struct.calcsize(RECORD_FMT)
SCALE: Final[int] = 10_000_000             # fixed-point: 1e-7 deg ~= 1.1 cm


class TelemetryRing:
    """Single-producer / multi-consumer ring over shared memory.

    Records are fixed-size, so a slot index is just arithmetic — no
    per-event allocation, no pickling, and the worker reads in place.
    """

    def __init__(self, name: str, capacity: int) -> None:
        self._cap: Final[int] = capacity
        self._shm: Final[shared_memory.SharedMemory] = shared_memory.SharedMemory(
            name=name, create=False
        )
        self._view: Final[memoryview] = self._shm.buf  # zero-copy window

    def read(self, slot: int) -> tuple[int, float, float, int]:
        base: int = (slot % self._cap) * RECORD_SIZE
        dev, lon_q, lat_q, ts = struct.unpack_from(RECORD_FMT, self._view, base)
        return dev, lon_q / SCALE, lat_q / SCALE, ts

    def close(self) -> None:
        self._view.release()
        self._shm.close()

Two non-obvious decisions are load-bearing. First, memoryview over the buffer means read produces no intermediate bytes object — the only allocation is the result tuple, which dies in gen-0. Second, fixed-point integers (SCALE = 1e7, ~1.1 cm resolution) keep coordinates exact and comparable without floating-point drift across the boundary, and they are what the quantized index expects. Worker pools should be bounded to os.cpu_count() - 2, leaving headroom for the event loop and kernel networking interrupts; oversubscribing the pool trades a little throughput for a lot of context-switch jitter, and IPC plus buffer synchronization already claims roughly 40% of a 50 ms budget before any geometry runs.

Memory Footprint & Streaming Churn

Steady-state RSS is the easy part; the failure mode is churn. Under sustained 1–10 Hz ingestion the allocator sees a constant stream of short-lived event objects, and if any of them are accidentally retained — a closure capturing a buffer, a per-device dict that never evicts — RSS creeps until the gen-2 collector runs, and that collection is the P99 spike. The discipline has three rules.

Reuse buffers instead of allocating per event. The ring above is one instance; the other is the per-evaluation scratch space (candidate-index arrays, result masks), which should be slab-allocated once per worker and overwritten, never recreated. With numpy, that means pre-sizing arrays to the worst-case candidate count and slicing, so the hot path makes zero array allocations.

Bound and evict per-device state explicitly. Hysteresis, dwell timers, and last-known-position all want per-device entries, and that map is the most common slow leak. Cap it with an LRU keyed on device id and an idle TTL (e.g., evict after 90 s of silence); a device that goes dark must not hold heap forever. At 2M tracked devices, an unbounded dwell map is hundreds of MB that never comes back.

Control the collector rather than fighting it. The hot path should produce almost only gen-0 garbage; long-lived structures (the index snapshot, the device LRU) should be created once and then made invisible to the collector with gc.freeze() so gen-2 sweeps stay cheap. The pattern:

python

import gc


def arm_evaluation_loop() -> None:
    """Move stable long-lived objects out of the GC's scan set.

    Called once after the index snapshot and device LRU are built, before
    the node starts accepting traffic, so steady-state collections only
    walk short-lived per-event objects.
    """
    gc.collect()        # promote everything currently live
    gc.freeze()         # exclude it from future gen-2 scans
    gc.set_threshold(50_000, 500, 1_000)  # back off automatic gen-2 sweeps

The practical effect, measured on the flat-grid node above, is gen-2 pause time dropping from ~14 ms (correlating 1:1 with P99 spikes) to under 2 ms, with RSS staying within ±40 MB of baseline over a 10-minute soak instead of climbing toward the ceiling. Fragmentation is the residual risk: even with flat allocation, the small-object arena can fragment under varying candidate-set sizes, so the worst-case scratch slab is sized once at startup rather than grown on demand.

Async Mutation Boundaries & Queue Semantics

Geofence definitions change at runtime — surge zones open, road closures appear, compliance perimeters shift — and those mutations must never block the query path or double the index’s memory. The pattern is copy-on-write snapshots behind an atomic pointer swap. A background task drains a bounded update queue, validates topology, builds a new immutable flat index, and swaps a reference; query coroutines dereference the current snapshot with no lock. The cost to the read path is zero, and the memory cost is bounded to two snapshots only during the swap window — the old one is reclaimed once its in-flight readers drain. This is the same snapshot-swap model the spatial indexing reference specifies for the index subsystem; here the constraint is that the transient 2× memory of the swap must still fit under the RSS ceiling, which is why the quantized layout’s smaller footprint matters twice.

The ingress queue must be bounded and backpressure-aware. An unbounded asyncio.Queue hides downstream degradation until the heap is gone, converting a latency problem into an OOM kill. Size the queue to the shared-memory pool, watch its depth, and shed deterministically:

python

import asyncio
from typing import Final

MAX_DEPTH: Final[int] = 8_192          # aligned to the ring capacity
SHED_AT: Final[int] = int(MAX_DEPTH * 0.75)


async def admit(queue: asyncio.Queue[bytes], event: bytes, *, priority: bool) -> bool:
    """Token-style admission with drop-to-log above 75% depth.

    Returns False when the event was shed so the caller can increment a
    shed_count metric instead of silently growing the heap.
    """
    if queue.qsize() >= SHED_AT and not priority:
        return False                    # drop low-priority, keep the budget
    try:
        queue.put_nowait(event)
        return True
    except asyncio.QueueFull:
        return False

For the multi-worker hand-off, a single-producer/multi-consumer ring with explicit watermark tracking outperforms a lock-guarded queue under the GIL because consumers advance independent read cursors and never contend on a single lock. The watermark — the lag between the producer’s write cursor and the slowest consumer’s read cursor — is the backpressure signal: once it exceeds 75% of ring capacity, the admission gate above starts shedding low-priority telemetry rather than letting lag turn into unbounded queueing. The distinction between this streaming topology and a disk-backed batch queue, which tolerates much larger allocation spikes, is laid out in streaming vs batch geofence evaluation.

Operational Runbook & Failure Mitigation

When a memory-constrained node breaches its budget, the first move is never to add replicas — horizontal scaling masks a leak and multiplies its cost. Work the node first, in order:

Confirm the symptom. Read gc.get_stats() and the exported rss_bytes gauge. If RSS is climbing monotonically, it is a retention leak; if RSS is flat but P99 spikes, it is GC pause or cache fragmentation, not memory volume.
Profile the hot path. Attach py-spy dump --pid <pid> for a stack snapshot and py-spy record -o flame.svg for 30 s under live load. A frame dominated by struct.unpack/array allocation means the scratch slab is being recreated; a frame in pickle means an offload boundary is still serializing.
Localize the leak. Take tracemalloc snapshots every 10k evaluations and diff with snapshot.compare_to(prev, "lineno"). Growth above ~50 MB between snapshots without matching event volume names the retaining line — most often the per-device LRU or a logging handler buffering records.
Quantify GC pauses. If gc.get_stats() shows gen-2 collections correlating with the P99 timeline, confirm gc.freeze() ran after warm-up and that gc.get_threshold() reflects the backed-off values. Target gen-2 pauses under 2 ms.
Check cache behaviour. Run perf stat -e LLC-loads,LLC-load-misses -p <pid> for 10 s. An L3 miss rate above 15% indicates index fragmentation or pointer chasing — verify the mmap index is NUMA-local (numactl --membind) and that the worker is pinned to the same node.
Verify graceful degradation. Inject 0.5–3.0 s GPS dropouts and a 3× burst. Confirm the node sheds low-priority events at 75% queue depth, routes velocity spikes (>200 km/h, ~>500 m/s) to the dead-letter path with a spatial_uncertain flag, and keeps the hot path inside budget.

Circuit-breaker triggers are concrete, not advisory: shared-memory utilization above 85%, ring watermark above 75% for three consecutive scrape intervals, or evaluation P99 above 120 ms each trip the breaker that guards the next stage and switch the fallback path on. The fallback for a saturated index is a coarse bounding-box or cached centroid-distance check — approximate, but bounded in both time and memory — held until backpressure resolves. GPS-specific degradation (dropouts, coordinate jitter, stale timestamps) is owned by fallback routing for GPS dropouts; the memory rule here is that a device that loses lock must not accumulate retried events on the heap — it goes to the dead-letter queue and reconciles off the hot path. Polygon boundary conditions — events landing exactly on shared edges, or micro-polygons created by quantization rounding — are a separate failure surface covered in handling polygon edge cases in high-frequency telemetry, which specifies the snapping tolerances and winding rules that keep trigger emission idempotent under memory pressure.

Expose these as first-class metrics: spatial_eval_latency_ms (histogram), rss_bytes, shared_memory_utilization_pct, ring_watermark_pct, device_lru_size, gc_gen2_pause_ms, and shed_count. Without device_lru_size and shed_count on a dashboard, a slow leak and a silent shed both look like “everything is fine” right up until the OOM.

Architectural Guidance: When to Choose This Approach

Memory-constrained flat processing is the right default for high-concurrency streaming nodes, but it is not free — fixed-point quantization, slab allocation, and CoW snapshots add real implementation cost. Use this matrix to decide.

Situation	Recommended approach	Why
>50k concurrent devices, hard RSS ceiling	Flat grid + quantized SoA index, shared-memory ring	Only layout that holds flat RSS and a sub-25 ms P99 at this scale
<5k devices, latency-relaxed, dev velocity matters	Shapely `STRtree`, rebuild on push	Simpler code; rebuild stalls are tolerable below a few k devices
Large polygons, infrequent updates, batch reconciliation	Disk-backed segment index + batch workers	Tolerates allocation spikes; not on the real-time hot path
Bursty load, tight tail SLA, occasional updates	Flat index + CoW snapshot swap	Zero-lock reads; bounded 2× memory only during swap
Memory cheap, CPU the bottleneck	Pointer R-tree + Numba PiP	Spend RAM to buy code simplicity when OOM is not the risk

In production the common shape is a hybrid: the flat quantized index on the streaming hot path for immediate triggers, a disk-backed batch process for nightly trajectory reconciliation and drift correction, and the snapshot-swap channel connecting configuration changes to the live index without a restart. The latency budget allocation framework is what ties the two together — it assigns each phase its slice of the budget and defines the degradation thresholds the circuit breakers enforce.

Choosing the index by where you sit on two axes: as device concurrency and the memory ceiling both tighten (top-right), the flat quantized layout with copy-on-write swaps is the only option that holds flat RSS and a sub-25 ms P99. Relax either constraint and a simpler or disk-backed approach buys back developer velocity.

Operator FAQ

Why does my RSS climb for hours and then the node gets OOM-killed even though throughput is flat?

Almost always an unbounded per-device structure — the dwell/hysteresis map or a debug logging buffer. Cap it with an LRU plus idle TTL and put device_lru_size on a dashboard. Diff tracemalloc snapshots every 10k events to find the retaining line.

My P50 is great but P99 spikes every few minutes — is that the network?

Check gc.get_stats() first. Periodic P99 spikes that line up with gen-2 collections are GC pauses, not the network. Confirm gc.freeze() ran after warm-up and that the hot path allocates only short-lived objects; target gen-2 pauses under 2 ms.

Can I just raise the container memory limit instead of doing all this?

It postpones the kill, it does not prevent it — a leak fills any ceiling, and a larger heap makes each gen-2 sweep slower, which worsens P99. The fix is bounding allocation and eviction, not raising the cap.

Is shared memory worth the complexity over a multiprocessing.Queue?

Above ~30k events/sec, yes. Pickling across the queue allocates and copies on both sides of every hand-off and is a primary driver of gen-2 churn; a fixed-size ring over shared_memory removes serialization entirely and flattens the tail.

Streaming vs batch geofence evaluation — when the memory model favours a streaming ring vs a disk-backed batch queue.
Point-in-polygon algorithm benchmarks — measured throughput of the exact test that runs over this flat layout.
Async Python execution patterns for spatial math — the GIL-free offload boundary that the shared-memory ring feeds.
Handling polygon edge cases in high-frequency telemetry — idempotent triggers and snapping tolerances under memory pressure.
Up one level: Core Architecture & Latency Constraints — the pipeline-wide latency and memory invariants this page implements.

Memory-Constrained Spatial Processing for Real-Time Mobility Telemetry

Algorithmic Divergence & Latency Profiles Under a Memory Ceiling #

Implementation Trade-offs: GIL, asyncio, and Heap Allocation #

Memory Footprint & Streaming Churn #

Async Mutation Boundaries & Queue Semantics #

Operational Runbook & Failure Mitigation #

Architectural Guidance: When to Choose This Approach #

Operator FAQ #

Related #

Algorithmic Divergence & Latency Profiles Under a Memory Ceiling

Implementation Trade-offs: GIL, asyncio, and Heap Allocation

Memory Footprint & Streaming Churn

Async Mutation Boundaries & Queue Semantics

Operational Runbook & Failure Mitigation

Architectural Guidance: When to Choose This Approach

Operator FAQ

Related