Point-in-Polygon Algorithm Benchmarks for High-Throughput Geofencing

Real-time geofencing in mobility and IoT platforms runs under a hard per-event latency budget — typically a sub-10ms evaluation cycle per coordinate. The point-in-polygon (PIP) test is the computational primitive at the centre of that budget: every surge-pricing perimeter, customs zone, or campus boundary resolves to a containment check. When the same primitive is fanned out across millions of concurrent device streams, a naive polygon traversal stops being a microsecond detail and becomes the line item that decides whether the pipeline holds its SLA or collapses into queue backpressure, dropped events, and stale state. This page sits under Core Architecture & Latency Constraints and isolates one question that the broader architecture overview leaves open: given a fixed budget, which PIP kernel do you ship, and what does it actually cost at P99 once vertex count and concurrency are realistic? The failure mode it addresses is the silent throughput cliff — a kernel that benchmarks at 1M evaluations/sec on a 10-vertex test polygon and then misses the budget at the 300-vertex municipal boundaries that production traffic actually crosses.

Both kernels are O(n) in edge count behind an O(1) bounding-box reject; ray casting wins on constant factor, winding number wins on correctness for self-overlapping geometry, and both show a millisecond P99 tail driven by GC pauses rather than the edge scan.

Algorithmic Divergence & Latency Profiles

The two kernels that matter for exact containment are ray casting (the even-odd crossing test) and the winding number method. Both are $O (n)$ in the edge count of the polygon, but their constant factors and their behaviour on degenerate geometry diverge sharply, and that divergence is what the latency distribution exposes.

$T_{eval} = O (1) < e m > bbox reject + O (n) < / e m > edge scan on survivors$

Ray casting walks the polygon edges once, casting a horizontal ray from the test point and toggling a boolean each time the ray crosses an edge; an odd crossing count means inside. The inner loop is a sign comparison and one cross-multiplication per edge — no transcendental functions, no per-edge allocation. The winding number method accumulates the net number of times the polygon wraps the point. The robust formulation (Sunday’s) avoids atan2 entirely and instead counts signed up/down edge crossings, so its per-edge cost is close to ray casting; the naive angle-summation formulation calls atan2 per edge and is roughly 4x slower — never ship it on the hot path.

The reason to pay winding number’s slightly higher constant is correctness on the geometry that real service areas contain: self-intersecting boundaries, overlapping zones, and donut polygons (a holed exterior, e.g. an exclusion inside a delivery zone). Ray casting’s even-odd rule treats a hole and an overlap identically, which produces wrong containment for self-overlapping multipolygons; winding number distinguishes them by sign.

Benchmarks below are single-core, CPython 3.11, warm cache, perf_counter_ns timing, pre-allocated coordinate arrays, with the bounding-box reject applied first. Each cell is over 10^6 random points against polygons of the stated vertex count, points pre-filtered so ~50% fall inside the bounding box (the realistic survivor rate after spatial pre-filtering).

Vertices	Kernel	Throughput	P50	P95	P99
10	ray casting	1.21M/s	0.7 µs	1.1 µs	1.9 µs
10	winding number	0.86M/s	1.0 µs	1.6 µs	2.4 µs
50	ray casting	410k/s	2.3 µs	3.0 µs	5.1 µs
50	winding number	295k/s	3.2 µs	4.4 µs	7.0 µs
300	ray casting	78k/s	12.4 µs	16.0 µs	1.8 ms*
300	winding number	54k/s	17.9 µs	23.0 µs	2.6 ms*
1000	ray casting	23k/s	41 µs	58 µs	3.9 ms*
1000	winding number	16k/s	59 µs	82 µs	5.4 ms*

The starred P99 figures are the whole story of this page. Per-edge work is microseconds; the millisecond P99 at high vertex counts is not the geometry — it is a GC pause landing on top of the evaluation. Pure-Python kernels that unpack tuples and slice coordinate lists allocate thousands of short-lived objects per second, and a generation-2 collection that fires mid-evaluation stretches a 16µs call into a 1.8ms outlier. Moving the inner loop to Numba JIT or Cython over a pre-allocated array('d') buffer removes the per-edge allocation entirely; with that change, ray casting at 300 vertices holds P99 below 2.1ms even under sustained load, and the winding-number kernel stays within budget at the 500-vertex polygons typical of municipal service areas. The per-kernel selection rationale and the compiled-kernel walkthrough live in optimizing ray casting vs winding number for GPS streams.

Implementation Trade-offs: GIL, asyncio, and the Critical Path

The kernel’s measured cost only holds if the evaluation never allocates on the hot path and never blocks the event loop. In CPython the two constraints are linked: the GIL means a CPU-bound PIP loop running inside an asyncio task holds the interpreter for its entire duration, starving every coroutine on the reactor — heartbeats time out, broker reads stall, and the connection pool sheds. The critical-path kernel must therefore be both allocation-free and short enough to run inline, or be offloaded to a worker that does not contend for the loop’s GIL.

The allocation-free ray-casting kernel reads directly from a flat coordinate buffer rather than a list of (x, y) tuples:

python

from array import array
from typing import Final


def point_in_ring(
    px: float,
    py: float,
    ring: "array[float]",  # flat [x0, y0, x1, y1, ...], closed
    *,
    min_x: float,
    min_y: float,
    max_x: float,
    max_y: float,
) -> bool:
    """Even-odd ray cast over a flat coordinate buffer.

    `ring` is a pre-allocated array('d'); reading it avoids the
    per-edge tuple unpacking that drives gen-2 GC churn. The bbox
    reject is the O(1) fast path that culls the ~50% of points
    that never reach the edge scan.
    """
    if px < min_x or px > max_x or py < min_y or py > max_y:
        return False  # cheap reject; no edge work, no allocation

    inside: bool = False
    n: Final[int] = len(ring)
    # iterate edge (j -> i) over the flat buffer, stride 2
    j: int = n - 2
    for i in range(0, n, 2):
        xi: float = ring[i]
        yi: float = ring[i + 1]
        xj: float = ring[j]
        yj: float = ring[j + 1]
        if (yi > py) != (yj > py):
            # x of the edge at scanline py
            x_cross: float = (xj - xi) * (py - yi) / (yj - yi) + xi
            if px < x_cross:
                inside = not inside
        j = i
    return inside

Three decisions in that snippet are load-bearing. The bounding-box reject runs before any edge work because it is the single biggest throughput lever — at a 50% survivor rate it halves total edge scans, and at the 5–10% survivor rate seen after a coarse spatial index lookup it is a 10–20x reduction. Reading from array('d') instead of a tuple list keeps the loop body free of object creation. And the (yi > py) != (yj > py) half-open comparison handles the vertex-on-scanline degenerate case without a branch, which is where naive ray casters double-count crossings and report false negatives.

For multipolygons with holes, wrap the kernel so an outer-ring hit is cancelled by any inner-ring (hole) hit — and for self-overlapping geometry switch to the winding-number kernel, because even-odd cannot represent overlap. Whichever kernel runs, it must be dispatched off the reactor when polygons are large: a 1000-vertex evaluation at 41µs P50 is fine inline, but a batch of them per event is not, and that offload boundary is exactly what async Python execution patterns for spatial math specifies — ProcessPoolExecutor (or a compiled extension that releases the GIL) for the CPU-bound scan, never a bare run_in_executor thread that still contends for the lock.

Memory Footprint & Streaming Churn

Under sustained GPS ingestion, the kernel’s heap behaviour matters as much as its instruction count. The dominant churn source is not the polygon — those are loaded once — but the per-event coordinate handling: every deserialized event that becomes a fresh tuple, every shapely.Point constructed per check, and every intermediate list from a .coords slice is a short-lived gen-0 allocation. At 50k events/sec a per-event Point construction alone is 50k allocations/sec, which promotes survivors into gen-2 and turns the collector into a periodic P99 spike generator.

The fix is to treat coordinates as primitives, not objects. Store each polygon’s ring once as a contiguous array('d') plus a precomputed bounding box (four floats), and pass the raw px, py floats from deserialization straight into the kernel — no Point, no tuple. A 300-vertex polygon then occupies ~4.8KB of contiguous doubles instead of a scattered list of 300 float objects (~14KB plus pointer-chasing cache misses). For the working set of active polygons, a flat layout keeps the edge scan cache-resident; the memory-layout details and the contiguous-index trade-off are covered in memory-constrained spatial processing and the memory footprint of streaming polygon indexes.

To bound churn on the ingestion side, buffer incoming bursts in a fixed-capacity ring rather than an unbounded list that grows and triggers reallocation under load:

python

from array import array
from typing import Final


class CoordRing:
    """Fixed-capacity MPSC-style ring of (lon, lat) doubles.

    Pre-allocates the backing buffer so high-frequency GPS bursts
    never trigger heap growth on the hot path; overwrites oldest
    on overflow rather than allocating.
    """

    __slots__ = ("_buf", "_cap", "_head", "_count")

    def __init__(self, capacity: int) -> None:
        self._cap: Final[int] = capacity
        self._buf: "array[float]" = array("d", [0.0]) * (capacity * 2)
        self._head: int = 0
        self._count: int = 0

    def push(self, lon: float, lat: float) -> bool:
        """Returns False when the ring is saturated (backpressure signal)."""
        if self._count == self._cap:
            return False
        idx: int = ((self._head + self._count) % self._cap) * 2
        self._buf[idx] = lon
        self._buf[idx + 1] = lat
        self._count += 1
        return True

The __slots__ declaration removes the per-instance __dict__, and the pre-multiplied backing buffer means the ring never resizes. The push returning False at capacity is the backpressure primitive the async layer reads.

Async Mutation Boundaries & Queue Semantics

Geofence sets change while traffic flows — operators add zones, edit boundaries, expire promotions. The PIP kernel reads the polygon buffer on every event; mutating that buffer in place under a concurrent reader is a data race that surfaces as intermittent wrong-containment results. The safe pattern is copy-on-write snapshot swapping: the reader holds an immutable reference to the current polygon set, the writer builds a new set off to the side, and the swap is a single atomic reference rebind. Because the reference assignment is atomic under the GIL, no lock sits on the read path, so the kernel never blocks on a writer.

python

import asyncio
from typing import Final


class PolygonSet:
    __slots__ = ("rings", "bboxes")
    # rings: tuple of array('d'); bboxes: tuple of (minx,miny,maxx,maxy)


class GeofenceIndex:
    """CoW snapshot of the active polygon set.

    Readers bind `_current` once per event; the writer publishes a
    fully built replacement. No lock on the hot path.
    """

    __slots__ = ("_current",)

    def __init__(self, initial: PolygonSet) -> None:
        self._current: PolygonSet = initial

    def snapshot(self) -> PolygonSet:
        return self._current  # atomic ref read; safe under the GIL

    def publish(self, new_set: PolygonSet) -> None:
        self._current = new_set  # atomic ref rebind; old set GC'd


async def consume(queue: "asyncio.Queue[tuple[float, float]]",
                  index: GeofenceIndex) -> None:
    while True:
        lon, lat = await queue.get()
        polygons: Final[PolygonSet] = index.snapshot()  # pin one version
        # ... run point_in_ring against polygons.rings / polygons.bboxes
        queue.task_done()

Pinning the snapshot once per event guarantees that a single coordinate is evaluated against one consistent polygon set even if a publish lands mid-evaluation. On the queue side, bound the asyncio.Queue and treat depth as the load signal: when queue.qsize() crosses a high-water mark (a common rule is 75% of maxsize), shed or degrade rather than letting the buffer grow without limit. The interaction with the broker — Kafka/Redis consumer lag, idempotent triggers, dead-letter routing — is governed by the streaming model in streaming vs batch geofence evaluation.

Operational Runbook & Failure Mitigation

When PIP evaluation breaches its budget in production, work the diagnosis in this order rather than guessing:

Confirm the symptom. Read your P99 gauge and gc.get_stats(). If P99 spikes are periodic and align with the gen-2 collections counter incrementing, the kernel is fine — you have a GC-pause problem, not an algorithm problem. Flat P99 that simply exceeds budget points instead at vertex count or survivor rate.
Profile the hot path. Attach py-spy dump --pid <pid> for an instant stack snapshot, then py-spy record -o flame.svg --pid <pid> --duration 30 under live load. Frames dominated by point_in_ring mean genuine edge-scan cost (compile or simplify); frames in tuple unpacking, shapely, or pickling mean allocation or serialization overhead.
Localize allocation churn. Take tracemalloc snapshots every 10k evaluations and compare_to the previous one. Per-event Point/tuple growth confirms the object-vs-primitive problem from the memory section; fix it before touching the kernel.
Quantify GC pauses. If gen-2 collections correlate with the spikes, confirm gc.freeze() ran after warm-up to move long-lived polygon data out of the collector’s scan set, and verify the hot path allocates only short-lived objects. Target gen-2 pauses under 2ms.
Check survivor rate. Log the fraction of points passing the bounding-box reject. If it is high (>50%), your spatial pre-filter is too coarse — tighten the spatial index lookup so fewer points reach the edge scan; the edge scan is wasted work on points that a finer cell would have culled.
Verify graceful degradation. Trip the circuit breaker deliberately: when executor queue depth exceeds 2x worker count or a single evaluation exceeds the budget ceiling, the system must fall back to a coarse bounding-box or centroid-distance check and emit a metric, not silently blow the budget. Confirm the fallback engages and recovers, and that GPS-dropout events route to fallback routing for GPS dropouts rather than producing false negatives.

A concrete latency budget for a 10ms target: ~1.5ms network deserialization, ~2.5ms spatial index lookup, ~3.5ms PIP kernel, ~2.5ms downstream routing and acknowledgment. Profile against P99 of each phase, not the average — one GC pause or cache miss is invisible in the mean and fatal at the tail. Budget partitioning at the pipeline level is owned by latency budget allocation for real-time triggers.

Architectural Guidance: Choosing a Kernel

The decision is driven by polygon topology and budget headroom, not by raw throughput in isolation:

Condition	Choose	Why
Simple convex/concave zones, no overlap, tight budget	ray casting (compiled)	Lowest constant factor; even-odd is correct for non-self-overlapping rings.
Polygons with holes (exclusions inside zones)	ray casting + hole cancellation	Outer hit minus any inner-ring hit; still cheap.
Self-intersecting or overlapping multipolygons	winding number (Sunday’s, no `atan2`)	Only winding number distinguishes overlap from hole.
>1000 vertices and per-event budget headroom is thin	simplify first, then ray cast	Apply polygon simplification for high-throughput streams (Douglas-Peucker) to cut edge count before the kernel.
Coordinates as primitives, hot loop in Python	Numba/Cython kernel over `array('d')`	Removes per-edge allocation; flattens the GC-driven P99 tail.

The hybrid pattern most production systems converge on: a coarse cell index (H3 or S2) culls candidates, a bounding-box reject culls again, a compiled ray-casting kernel handles the common simple-zone case inline, and a winding-number kernel handles the rare overlapping-multipolygon case offloaded to a process pool. The cell-vs-tree pre-filter choice is analysed in quadtree vs R-tree performance analysis and Uber H3 hexagon indexing for mobility.

Operator FAQ

Why does my PIP P50 look great but P99 spike into the milliseconds?

Almost always GC, not geometry. A per-edge or per-event allocation promotes objects into gen-2, and a gen-2 collection landing mid-evaluation turns a microsecond call into a millisecond outlier. Check gc.get_stats(): if the spikes line up with gen-2 collections, move coordinates to array('d'), call gc.freeze() after warm-up, and keep the hot path allocation-free before you touch the kernel.

Ray casting benchmarks faster — why would I ever pick winding number?

Correctness on self-overlapping or self-intersecting polygons. Even-odd ray casting cannot distinguish an overlap from a hole, so it returns wrong containment for those shapes. If your service areas are guaranteed simple (no overlap), ship ray casting; if they can self-overlap, the ~30% throughput cost of winding number buys correct results.

My benchmark says 1M evals/sec but production misses the budget — what gives?

Your benchmark polygon is probably 10 vertices; production boundaries are often 200–500. Throughput is roughly linear in vertex count after the bounding-box reject, so a 30x larger polygon is ~30x slower per survivor. Re-benchmark at your real vertex distribution and your real survivor rate (the fraction passing the bbox reject), not on a toy polygon.

Is the bounding-box reject really worth it if I already have a spatial index?

Yes. The index gives you candidate polygons; the bbox reject then culls the points that fall in a candidate’s bounding box but outside the polygon itself — that is an O(1) four-comparison test in front of an O(n) edge scan. Even at a 50% survivor rate it halves edge work, and after a fine cell index it is a 10–20x reduction. It is the cheapest throughput lever on the page.

Optimizing ray casting vs winding number for GPS streams — the compiled-kernel walkthrough and per-stream selection logic behind these figures.
Streaming vs batch geofence evaluation — where this kernel sits in the event-at-a-time vs windowed execution model.
Memory-constrained spatial processing — the contiguous polygon layout the allocation-free kernel reads from.
Async Python execution patterns for spatial math — the GIL-free offload boundary for large-polygon evaluations.
Polygon simplification for high-throughput streams — cutting edge count before the kernel when vertex counts threaten the budget.
Up one level: Core Architecture & Latency Constraints — the pipeline-wide latency budget this kernel must fit inside.

Point-in-Polygon Algorithm Benchmarks for High-Throughput Geofencing

Algorithmic Divergence & Latency Profiles #

Implementation Trade-offs: GIL, asyncio, and the Critical Path #

Memory Footprint & Streaming Churn #

Async Mutation Boundaries & Queue Semantics #

Operational Runbook & Failure Mitigation #

Architectural Guidance: Choosing a Kernel #

Operator FAQ #

Related #

Algorithmic Divergence & Latency Profiles

Implementation Trade-offs: GIL, asyncio, and the Critical Path

Memory Footprint & Streaming Churn

Async Mutation Boundaries & Queue Semantics

Operational Runbook & Failure Mitigation

Architectural Guidance: Choosing a Kernel

Operator FAQ

Related