Memory Footprint of Streaming Polygon Indexes

Real-time geofencing at mobility scale requires sub-millisecond containment evaluations against boundaries that mutate continuously. When dispatch orchestrators, IoT telemetry gateways, or dynamic logistics planners push zone definitions into a live routing pipeline, the spatial index stops being a static lookup table and becomes a mutable, high-churn state machine. In CPython-backed services the resident memory of that index directly dictates cache residency, garbage-collection (GC) pressure, and the P99 tail of event routing. This page expands the index-structure model introduced in Spatial Indexing for Real-Time Checks, narrowing the focus to one question: how does the heap behave when polygons are inserted, split, and retired thousands of times a minute, and how do you keep that footprint bounded without sacrificing query latency?

The exact failure mode this page addresses is silent RSS creep under churn. Logical index size — the count of active polygons — stays flat, so dashboards look healthy, while the process resident set size (RSS) climbs steadily until the orchestrator OOM-kills the worker mid-shift. The cause is rarely a true leak; it is the interaction of CPython’s object model with high-frequency allocation across many size classes, compounded by tree rebalancing that strands old nodes faster than the collector reclaims them. Treating memory as a first-class design constraint — not a tuning afterthought — is what separates an index that holds a sub-8ms P99 at 50k events/sec from one that pages, stalls the event loop, and cascades backpressure into the latency budget allocation reserved upstream.

Allocation Divergence & Footprint Profiles

The first thing to measure is what a polygon actually costs once it lands in the interpreter. A naive index stores each geometry as a list of coordinate tuples. In CPython every tuple and every float is a full heap object carrying a reference count, a type pointer, and size metadata. A single (lon, lat) pair stored as a tuple of two boxed floats consumes roughly 56 bytes for the tuple header plus 24 bytes per float — about 104 bytes for two numbers that hold 16 bytes of actual data. That is a 6.5x inflation before any tree metadata exists.

Scaled to a fleet, the divergence is stark. The table below shows measured per-polygon and aggregate resident cost for 15,000 active delivery zones at an average of 120 vertices each, comparing storage strategies on CPython 3.11 (x86-64, pymalloc default allocator):

Vertex storage strategy	Bytes / vertex	Bytes / 120-vertex polygon	RSS for 15k polygons	P99 lookup @ 50k ev/s
`list[tuple[float, float]]` (naive)	~104	~8.6 KB	~129 MB raw, ~340 MB with tree	11–14 ms
`__slots__` node + `list` vertices	~104	~8.4 KB	~210 MB with tree	9–11 ms
`array.array('d')` flat coordinates	~16	~2.0 KB	~62 MB with tree	7–8 ms
`numpy.ndarray` (N, 2) `float64`	~16	~2.0 KB + 96 B header	~60 MB with tree	6–8 ms

The headline figure: moving vertex storage from boxed tuples to a contiguous array.array('d') cuts polygon body memory by roughly 5x and drops P99 by several milliseconds, because contiguous doubles fetch in far fewer cache lines during point-in-polygon descent. The tree scaffolding (bounding boxes, parent pointers, child arrays) is what turns a 129 MB raw figure into 340 MB; the storage primitive you choose for that scaffolding matters as much as the vertex encoding, which is the trade-off dissected in quadtree vs R-tree performance analysis.

Under streaming conditions the static table understates the problem, because RSS is not just steady-state size — it is steady-state size plus the high-water mark of transient allocations during churn. An R-tree that inserts and splits nodes every few seconds allocates and abandons node objects continuously; pymalloc reuses small-object arenas well but cannot return partially-occupied arenas to the OS, so RSS ratchets toward the high-water mark and stays there. The interpreter’s object model imposes a 30–40% memory tax over equivalent C-struct layouts even before fragmentation, which is why heavy churn workloads frequently see RSS settle 2–3x above the logical working set.

Implementation Trade-offs: Beating the Object Tax

The mitigations are concrete and Python-specific. First, eliminate per-instance __dict__ allocations with __slots__ on every hot-path class — this removes the 100+ byte dictionary CPython otherwise attaches to each node. Second, store vertices in contiguous buffers (array.array or a numpy view) instead of lists of tuples. Third, pre-allocate a geometry pool so insertions reuse zeroed buffers rather than minting new objects on the hot path. The following node shows all three on the critical path:

python

from __future__ import annotations

import array
from typing import Final

# A 120-vertex polygon stored as flat doubles costs ~2 KB here versus
# ~8.6 KB as list[tuple[float, float]] — a 4x reduction before tree overhead.
_COORDS_PER_VERTEX: Final[int] = 2


class PolygonNode:
    """Cache-friendly polygon node for a streaming spatial index.

    __slots__ removes the per-instance __dict__ (saves ~104 B/instance),
    and the flat coordinate buffer keeps vertices in one contiguous run so
    the point-in-polygon scan touches a handful of cache lines, not hundreds.
    """

    __slots__ = ("zone_id", "coords", "min_x", "min_y", "max_x", "max_y")

    def __init__(self, zone_id: int, coords: array.array) -> None:
        self.zone_id: int = zone_id
        self.coords: array.array = coords  # 'd' typecode, len == 2 * n_vertices
        # Bounding box cached once at insert so the prefilter never recomputes it.
        xs = coords[0::_COORDS_PER_VERTEX]
        ys = coords[1::_COORDS_PER_VERTEX]
        self.min_x: float = min(xs)
        self.min_y: float = min(ys)
        self.max_x: float = max(xs)
        self.max_y: float = max(ys)

    def reset(self, zone_id: int, coords: array.array) -> None:
        """Rebind an existing instance from the pool — no new allocation."""
        self.zone_id = zone_id
        self.coords = coords
        xs = coords[0::_COORDS_PER_VERTEX]
        ys = coords[1::_COORDS_PER_VERTEX]
        self.min_x, self.min_y = min(xs), min(ys)
        self.max_x, self.max_y = max(xs), max(ys)


class NodePool:
    """Freelist of reusable nodes to amortise allocation under churn."""

    __slots__ = ("_free",)

    def __init__(self, capacity: int) -> None:
        # Pre-size the pool so steady-state churn never hits the allocator.
        self._free: list[PolygonNode] = [
            PolygonNode(-1, array.array("d")) for _ in range(capacity)
        ]

    def acquire(self, zone_id: int, coords: array.array) -> PolygonNode:
        node = self._free.pop() if self._free else PolygonNode(zone_id, coords)
        if node.zone_id == -1:
            node.reset(zone_id, coords)
        return node

    def release(self, node: PolygonNode) -> None:
        node.zone_id = -1  # poison so a stale reference is detectable
        self._free.append(node)

The non-obvious constraints are GIL-shaped. array.array slicing (coords[0::2]) copies under the GIL, so the bounding-box computation in __init__ is single-threaded but cheap; for very large polygons, prefer a numpy view and vectorised min/max, which release the GIL during the reduction and let other coroutines progress. The pool’s release deliberately poisons zone_id to -1 so a use-after-release shows up as an obvious sentinel rather than silently serving stale geometry — a defensive pattern that pays for itself the first time a retired zone is double-freed. Reducing vertex count before insertion via Douglas-Peucker polygon simplification compounds every one of these savings, because the cheapest vertex to store is the one you never inserted.

Memory Footprint & Streaming Churn

Steady-state size is the easy half. The hard half is the dynamic behavior under sustained insert/retire pressure, where three distinct effects stack.

Header tax is fixed and predictable: every Python object carries 16 bytes of refcount and type pointer, so an index of millions of small objects pays a floor you cannot tune away — you can only reduce the object count by flattening structure. Fragmentation is the churn-driven contributor: pymalloc groups allocations into 256 KB arenas by size class, and an arena is only returned to the OS when every block in it is free. High-frequency churn across mixed size classes leaves arenas perpetually 60–90% occupied, so RSS pins near the high-water mark even as logical size falls. Stranded generation-2 objects are the latency contributor: parent-pointer trees create reference cycles that the reference counter alone cannot reclaim, so retired nodes survive until a full generation-2 collection — and that collection’s cost scales with the number of tracked objects, producing the P99 spikes operators see.

A useful mental model: peak RSS approximates the logical working set times the object-model tax times a fragmentation factor that grows with churn rate and size-class spread. Driving each term down — fewer objects (flattening), tighter size classes (uniform pool buffers), and fewer tracked containers (__slots__ plus gc.freeze() on the static base) — is what converts a sawtooth-climbing RSS graph into a flat line. A pool-backed index that reuses fixed-size buffers collapses the fragmentation factor toward 1.0 because every allocation comes from the same size class, which is precisely why the NodePool above matters more under churn than at startup.

Eviction policy interacts with all of this. A drop-oldest retirement keeps the working set bounded but must return buffers to the pool, not the allocator; releasing them with del instead of pooling re-opens the fragmentation hole the pool was built to close. For workloads where exact geometry is negotiable, fixed-resolution cells from Uber H3 hexagon indexing for mobility sidestep churn entirely: cell IDs are uniform 64-bit integers in dense arrays, so insert and retire are O(1) integer operations with zero pointer fragmentation and deterministic memory growth. The deeper memory-vs-precision dimension is treated in memory-constrained spatial processing.

Async Mutation Boundaries & Queue Semantics

A streaming index cannot take a lock on the hot read path, so mutation has to be decoupled from traversal. The production shape is a bounded queue feeding a background compactor that builds the next version off-path, then promotes it with a single atomic reference swap — the lock-free pattern detailed in async index updates without locking. From a memory standpoint the queue is where footprint discipline lives or dies: an unbounded asyncio.Queue will happily buffer a zone-update burst into gigabytes of pending deltas, turning a traffic spike into an OOM.

python

import asyncio
from dataclasses import dataclass


@dataclass(slots=True, frozen=True)
class ZoneDelta:
    zone_id: int
    coords: tuple[float, ...]  # immutable so the snapshot it builds is safe to share
    retire: bool = False


class StreamingIndex:
    __slots__ = ("_queue", "_active", "_pool", "_overflowed")

    def __init__(self, queue_capacity: int, pool_capacity: int) -> None:
        # Bounded queue is the backpressure mechanism AND the memory ceiling:
        # at most queue_capacity deltas can be in flight at once.
        self._queue: asyncio.Queue[ZoneDelta] = asyncio.Queue(maxsize=queue_capacity)
        self._active: tuple[PolygonNode, ...] = ()
        self._pool = NodePool(pool_capacity)
        self._overflowed = False

    def submit(self, delta: ZoneDelta) -> bool:
        """Non-blocking producer path. Returns False on overflow so the caller
        can shed load instead of stalling the telemetry ingest loop."""
        try:
            self._queue.put_nowait(delta)
            return True
        except asyncio.QueueFull:
            self._overflowed = True  # trips the coarse-grained fallback
            return False

    async def compact_forever(self, batch: int = 256) -> None:
        while True:
            first = await self._queue.get()
            deltas = [first]
            # Drain opportunistically so one rebuild absorbs a whole burst.
            while len(deltas) < batch:
                try:
                    deltas.append(self._queue.get_nowait())
                except asyncio.QueueEmpty:
                    break
            self._active = self._apply(self._active, deltas)  # off-path build
            self._overflowed = False

Queue semantics dictate failure behavior. A drop-oldest policy keeps telemetry ingestion flowing but risks serving a stale geofence during rapid zone shifts; a block-on-full policy guarantees consistency but pushes producer-side latency spikes back into the upstream IoT gateways. The pattern that holds P99 is a hybrid: a bounded queue whose overflow trips a circuit breaker, falling back to a coarse bounding-box-only check until the compactor drains and the breaker resets — the same backpressure discipline described in the async Python execution patterns for spatial math. Crucially, the compactor reclaims retired nodes through the pool, not through del, so generation-1 collection stays under 2 ms and the freelist absorbs the next burst without touching the allocator.

Operational Runbook & Failure Mitigation

Footprint work is meaningless without continuous measurement. Run tracemalloc in staging and sampled in production with snapshot diffing to isolate allocation hotspots, and track three signals: heap allocation rate (MB/s), RSS delta over rolling 5-minute windows, and GC collection duration per cycle. A healthy streaming index holds allocation under 5 MB/s, RSS delta near zero at steady logical size, and generation-2 pauses under 10 ms. Departures from those bands map to specific causes below.

Symptom: RSS climbs steadily while logical polygon count is flat. Capture tracemalloc.take_snapshot() before and after a 10-minute window and compare_to(..., "lineno") to find the growing line. Run objgraph.show_growth() to surface leaked geometry wrappers, and gc.get_objects() length to confirm tracked-object growth. Verify every hot-path class declares __slots__, that retired nodes are returned to the pool (not del-ed), and that no parent-pointer cycle is pinning generation-2 objects. Apply gc.freeze() after building the static base so the immutable substrate is never re-scanned.
Symptom: P99 latency spikes correlate with GC events. Read gc.get_stats() per generation and watch collections and collected counts against the latency timeline. Raise generation-2 thresholds with gc.set_threshold(700, 50, 50) to reduce full-collection frequency, switch any remaining list-backed vertices to pool-backed array.array, and schedule a full-index rebuild during a low-traffic window to defragment. Confirm the win with a py-spy record --idle flame graph showing reduced time in collect.
Symptom: queue overflow trips the circuit breaker during peak dispatch. Inspect self._queue.qsize() against maxsize to confirm the producer is outrunning the compactor. Increase queue capacity only if tracemalloc shows headroom; otherwise apply Douglas-Peucker simplification on ingest to cut vertex count before insertion, and validate that simplified boundaries stay within the geofence-accuracy SLA. Raise the compactor batch size so one rebuild absorbs a larger burst.
Symptom: allocation latency exceeds 50 ms under churn (fragmentation). Confirm with py-spy dump that a stack is parked in pymalloc/mmap. Swap the allocator to jemalloc or mimalloc via LD_PRELOAD; both handle high-churn small-object patterns better and typically cut RSS bloat 15–25%. Pre-size the NodePool to the observed high-water count so steady-state churn never reaches the system allocator at all.

Wire rss_delta_5m_mb, gc_gen2_pause_ms, index_queue_depth, and alloc_rate_mb_s into the same dashboard, and alert on gc_gen2_pause_ms > 10, index_queue_depth > 0.8 * maxsize, and a sustained positive rss_delta_5m_mb at flat logical size — each is an early warning before the worker OOM-kills and backpressure cascades into the point-in-polygon algorithm benchmarks budget downstream.

Architectural Guidance: Choosing a Storage Strategy

There is no single right footprint strategy; the choice tracks churn rate, precision requirements, and how much transient memory headroom the deployment tolerates. The matrix below captures the production decision.

Workload characteristic	Preferred strategy	Rationale
High vertex counts, moderate churn	`array.array('d')` + `__slots__` nodes	Flat doubles cut body memory ~5x and improve cache locality on descent
Very high churn, retire-heavy	Pre-sized `NodePool` + drop-oldest	Reuse collapses fragmentation toward 1.0; allocator never touched at steady state
Precision negotiable, append-heavy	H3 fixed-resolution cells	O(1) integer insert/retire, zero pointer fragmentation, deterministic growth
Huge index, tiny frequent edits	Geography-sharded pools	Rebuild only the touched shard; per-city swap domains keep reclamation local
Pathological GC pauses on static base	`gc.freeze()` + raised thresholds	Immutable substrate stops being re-scanned; full collections grow rarer

Most large platforms converge on a hybrid: pool-backed array-encoded nodes for the precise per-zone geometry, sharded by city so a dense-downtown rebuild never blocks reclamation in a quiet suburb, with an H3 coarse layer in front as the bounding-box prefilter. That layering keeps the exact-geometry working set small enough to fit in cache while the integer cell layer absorbs the bulk of the churn for free.

Operator FAQ

Why does RSS stay high after I retire thousands of polygons?

Because pymalloc returns memory to the OS only when an entire 256 KB arena is free, and high churn leaves arenas partially occupied. Retiring objects lowers logical size but not RSS. Returning buffers to a pool (so future inserts reuse them) or doing a periodic full rebuild that frees whole arenas is what brings RSS back down; switching to jemalloc/mimalloc reduces how high it climbs in the first place.

Is numpy always better than array.array for vertex storage?

For small polygons, array.array wins — it has no 96-byte ndarray header and slicing is cheap. For large polygons or where you run vectorised geometry math (bounding-box reductions, point-in-polygon on many candidates), numpy wins because its reductions release the GIL and amortise the header across many vertices. Measure at your median vertex count; the crossover is usually around 50–100 vertices.

How do I tell a real leak from fragmentation?

A leak grows len(gc.get_objects()) and shows a climbing line in a tracemalloc snapshot diff. Fragmentation keeps object count flat but holds RSS high — tracemalloc reports stable Python-tracked memory while the OS RSS stays elevated. If tracked memory is flat and RSS is high, it is fragmentation: pool your buffers or change allocator. If tracked memory climbs, it is a leak: find the cycle or the missing pool release.

Conclusion

The memory footprint of a streaming polygon index is not a configuration value — it is a dynamic constraint that decides whether the service survives peak load. The invariants engineers must preserve are precise: encode vertices as contiguous doubles, never boxed tuples; declare __slots__ on every hot-path class; reclaim retired geometry through a pre-sized pool rather than the system allocator; bound the mutation queue so a burst becomes backpressure instead of an OOM; and keep generation-2 collections rare with raised thresholds and a frozen static base. Hold those five, profile continuously with tracemalloc, gc.get_stats(), and py-spy, and a streaming index sustains sub-8ms P99 containment checks at 50k events/sec while editing its boundaries thousands of times a minute — with a flat RSS line instead of a creeping one.

Spatial Indexing for Real-Time Checks — parent reference for index structure, mutation, and profiling
Async Index Updates Without Locking
Quadtree vs R-Tree Performance Analysis
Uber H3 Hexagon Indexing for Mobility
Polygon Simplification for High-Throughput Streams
Memory-Constrained Spatial Processing

Memory Footprint of Streaming Polygon Indexes

Allocation Divergence & Footprint Profiles #

Implementation Trade-offs: Beating the Object Tax #

Memory Footprint & Streaming Churn #

Async Mutation Boundaries & Queue Semantics #

Operational Runbook & Failure Mitigation #

Architectural Guidance: Choosing a Storage Strategy #

Operator FAQ #

Conclusion #

Related Pages #