Designing a Solution
51Degrees asks the question: How do you efficiently perform spatial queries on a global dataset using arbitrary polygons?
We needed a geospatial system that could handle global datasets, sparse data, and complex polygon queries without sacrificing performance. Existing approaches got us part of the way there, but none met all our requirements. So, we built our own.
We knew that whatever geospatial system we employed needed to be fast, flexible and adaptable to any dataset. It needed to be able to:
- Store generic data
- Handle sparsely populated values
- Aggregate data intelligently to adapt to the fact that uniform quadrants aren't always the most optimal representation of a dataset.
- Query the data structure using a Polygonal area (list of coordinates) where data is returned for a given area.
To achieve this, the core features of the data structure needed to be:
- Highly parallelizable for insertion/construction
- Data Efficient - Don't store absent excess data
- Data Respondent - Can change shape depending on the data inserted
- Performant for querying
Existing approaches to geographic data lookups met some of these requirements but presented issues during the prototyping phase. NetTopologySuite's Sort-Tile-Recursive R-tree suffers when a vast number of entries need to be added and stored quickly and efficiently.
Both Quadtree and indexes built around mathematically derived cells (Geohash/Google's S2/Uber's H3) assume a relatively uniform data distribution, meaning optimal aggregation of the data was either not possible or not easily achievable.
All node envelopes are represented in WGS84 latitude/longitude coordinates.
We decided to implement our data structure, taking core traits from Quadtrees, R-trees and Adaptive raster aggregation strategies.
Spatial Tree Architecture
Our implementation of a geospatial index data structure builds upon quad tree approach. Each node has an envelope that represents the bounded area of the node's data and each node is either a structural node (a node that has no value set but contains child nodes) or a leaf node (a node with a value and no children). Unlike a quad tree, each node can have any amount of children and a node's bounds do not have to border or align with the other child nodes within a parents bounds. This means sparsely populated areas of data do not need to be covered when they don't need to.
Node<T>:
- Envelope (minX, minY, maxX, maxY)
- Value (optional)
- Children (list of Node<T>)
- IsLeaf (bool)
Division diagram:
RootNode
|
+---------------+------------------+
| | |
Region A Region B Region C
| | |
| +----------+ +-----+-----+
| | | | | |
Leaf A Leaf B1 Leaf B2 (... Further subdivions)
Key:
- Some regions become leaf nodes immediately (uniform data)
- Others subdivide further based on data complexity
- Children are NOT uniform or aligned like a quad tree
Unlike a traditional quad tree, nodes subdivide into irregular regions based on data distribution rather than fixed quadrants.
To traverse the tree, the input polygon's intersection with a node at each depth is calculated. If the query polygon fully contains the node envelope, we can take the node's value directly without traversing deeper. If the node only partially overlaps, we traverse into it's children and compute weighted contributions.
The index builder processes the tree breadth-first, level-by-level, in parallel.
Each level is processed as a batch, where nodes are evaluated independently in parallel. Since nodes do not share mutable state, this avoids contention and allows efficient scaling across cores.
This parallel processing is crucial for build times on a global dataset. Each node gets checked: either assign a value or create children. A key invariant we maintain is that every terminal leaf ends up with a non-null value i.e we never produce a leaf node that's empty. This is also useful for reporting build stats per level, understanding where nodes are getting created most and the areas where it's taking longer to process.
Query Performance
Query performance was a primary design consideration from day one. The traversal of the tree is iterative and not recursive, which keeps memory usage predictable and avoids potential stack overflow issues on deep trees.
Traversal of the tree follows three basic rules per node. Firstly, if a node's envelope doesn't intersect the query area, we skip the entire subtree. Secondly, if the query polygon fully covers a node, we add all it's children without checking individual geometries. Thirdly, when only part of a node intersects the query, we weight the contribution by the intersection area ratio, giving accurate results even for complex polygons.
Full intersection calculations using a polygon are used only when necessary as these operations are extremely expensive.
Core Components
To build the index, three key components work together to create, subdivide and assign values to nodes.
The Data Provider component holds knowledge of the dataset(s), determines whether a given area contains meaningful data and produces values for that area. It answers two fundamental questions: "Is there any meaningful value within this geographic envelope?" and "If there is, what value should I assign?". In practice, the provider evaluates a node by sampling or aggregating the underlying dataset (e.g. raster pixels or vector intersections) and determining whether the variation within the node falls within an acceptable tolerance.
The Spatial Divider component is purely responsible for deciding for each node whether to stop and assign a value or continue to subdivide the node's bounds into children. To do this the divider calls the Data Provider to first understand if the current node sufficiently represents the data within it's bounds. If it does then a value can be assigned. Otherwise, it creates subdivisions (e.g. 6x6) across the node's bounds, checks which cells contain data and represents this as a binary grid.
1 1 1 0 0 0
1 1 1 0 1 1
1 1 1 0 1 1
0 0 0 0 1 1
1 0 0 1 0 1
0 1 1 1 0 0
[A A A . . .]
[A A A . B B]
[A A A . B B]
[. . . . B B]
[C . . D . E]
[. F F F . .]
Each contiguous region of occupied cells is merged into the largest possible rectangle. These rectangles define the child nodes for the next level of the tree.
It then passes this to the division algorithm (below), which returns the subdivisions as a set of envelopes, which are then converted into new child nodes.
The Division Algorithm component takes as input the binary occupancy grid (mentioned above) and the current nodes envelope, finds an optimal way to merge and represent the divisions, and then returns these as new envelopes.
Query performance is a primary objective of this design and as the query performance of the tree is directly correlated with intersection calculations, having the fewest child rectangles per node results in fewer intersection calculations, mitigating performance bottlenecks.
To achieve this, we researched algorithms that find and produce the largest rectangle for a given area. This is because iteratively finding largest rectangle until the entire grid is accounted for is a good proxy for the lowest amount of representable rectangle areas. Two existing algorithmic approaches were used as a reference points, the StackHistogram algorithm and a PrefixCalculation (Approach-2) algorithm.
In order to achieve maximal performance in this area (this operation will be happening millions of times during the building of the index) we developed our own 'Branching' algorithm, which outperformed the stack histogram and the prefix algorithm on small grids in our benchmarks.
Stopping conditions
Subdivision stops when any of the following conditions are met:
- The node size matches the underlying data resolution (e.g. GeoTIFF pixel size),
- All values within the node can be aggregated within the configured tolerance,
- A maximum resolution is reached (to prevent excessive subdivision).
Branching algorithm
Our Branching algorithm does not always guarantee that the largest rectangle will be found, but proves to be similarly if not more conservative with the minimal amount of rectangles found. Unlike histogram-based approaches which evaluate rows independently, the branching algorithm expands from high-value seed regions outward, allowing it to capture larger contiguous areas earlier and reduce fragmentation. The procedural steps of the branching algorithm are as follows:
- Calculate initial priority map based on adjacent boundaries,
- Find the highest-priority available cell as the rectangle seed,
- Expand rectangles in all 4 directions from the seed until constrained by a boundary or absent data,
- Mark the rectangle and update the working grid,
- Recalculate priorities for the affected region only,
- Repeat until no valid cells remain.
Here are benchmarks run with BenchmarkDotNet on an Apple M2 MacBook Pro - 16GB RAM.
RandomSeed: The grid is randomly generated using a seed. Each cell has a 50% change of assigning data to a cell.
The Branching algorithm performs best on small, irregular (random) grids, which more closely resemble real-world geographic data.
Note: As mentioned above, the branching algorithm will likely produce a different grid to StackHistogram and Prefix. These benchmarks are testing the time taken to return a result, not the exact same result.
| Algorithm | GridSize | Mean (μs) | Error (μs) | StdDev (μs) | Ratio | Gen0 | Allocated (KB) | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|
| Branching | 32 | 164.71 | 2.678 | 2.505 | 0.41 | 0.7324 | 6.09 | 0.03 |
| Prefix | 32 | 404.36 | 4.289 | 3.802 | 1.00 | 25.8789 | 213.78 | 1.00 |
| StackHistogram | 32 | 462.36 | 3.356 | 2.975 | 1.14 | 8.7891 | 75.06 | 0.35 |
| Branching | 64 | 2,993.98 | 18.876 | 17.657 | 0.43 | - | 24.09 | 0.37 |
| Prefix | 64 | 6,915.39 | 31.947 | 26.677 | 1.00 | 7.8125 | 64.34 | 1.00 |
| StackHistogram | 64 | 7,956.18 | 38.551 | 36.061 | 1.15 | 62.5000 | 513.05 | 7.97 |
| Branching | 128 | 70,599.47 | 515.376 | 482.083 | 0.42 | - | 96.09 | 0.37 |
| Prefix | 128 | 168,060.07 | 1,669.084 | 1,561.263 | 1.00 | - | 261.69 | 1.00 |
| StackHistogram | 128 | 196,814.76 | 1,380.711 | 1,223.964 | 1.17 | 333.3333 | 3700.96 | 14.14 |
Data-Respondent vs Standard Algorithms
An important distinction can be made between a standard divider algorithm and 'data-respondent' algorithms. A standard algorithm will divide based purely on the absence of data, whereas a data-respondent algorithms will consider the actual values being stored, comparing them in order to aggregate grid cells that are similar enough within a set tolerance. A comparer that is personalized to the type of data stored in the index can be used within the algorithm in order to understand where aggregation should happen.
For example, two cells can be merged if:
abs(valueA - valueB) < tolerance
The value of the tolerance threshold here is entirely dependent on the nature of the dataset and the extent that lossy compression should be performed.
If a dataset has a margin or error of 10%, then a 10% tolerance will cause at most 10% aggregation of adjacent data. This is compression that is indeed destructive but we understand that it is still reflective of the assumptions made about the data in the first place.
The distinction between the two types of algorithms matters when an underlying dataset groups data differently to absent data. For example, a dataset containing the average population per a given area will generally see clusters of larger values near cities, with decreasing values spreading out into the suburbs. An algorithm that is not data-aware would simply lump high-density areas with low-density areas and be forced to resolve these using more subdivisions deeper in the spatial tree. A data-aware algorithm would make more optimal divisions earlier on, differentiating appropriately. This also causes fewer structural nodes of the graph to be used, meaning the query time of the total tree is decreased.
Data-respondent algorithms are slower to compute, but they produce more optimal divisions for complex datasets.
Real-world Data: Implementation and Trade-Offs
51Degrees' first production use case of our spatial index implementation is for a global lookup for population values. These are then used to populate values in our IP Intelligence data file for Countries Properties. An IP Address is associated with a list of countries, weighted either by population contribution or geographical area contribution.
To support this, the spatial tree project is used to store global population data, which is extracted and aggregated into these values.
The data flow for building this index works as follows:
- Load country border geometries in GeoJSON format into a NetTopologySuite STRtree for fast spatial lookups,
- Preload all raster tiles (GeoTIFF) containing population data in 1km squared pixels,
- The provider combines country geometry checks with raster sampling to determine population values,
- Build the tree using the Branching algorithm,
- A query service wraps the tree, handling polygon and multipolygon inputs with weighted area aggregation.
The resolution strategy is particularly important. To avoid creating unnecessary structural nodes, the tree builder must stop subdividing once a node's size matches the GeoTIFF pixel resolution. This prevents the tree from going deeper than the actual data precision which would be wasteful and could introduce false precision. It's possible for the tree to want to continue to subdivide if, for example, the node lies on the border between two countries, and there is never a proper resolution between the two. The population comparer handles tolerance for aggregated values, using a margin of error approach to determine when values are similar enough to merge.
A 6x6 subdivision grid size provides good aggregation without being too generic or creating too many children in early tree levels. Creating fewer divisions can lead to nodes with envelopes which are too large and are imprecise at properly representing the underlying data. Conversely, more divisions increase build time without a meaningful query performance improvement.
Geographical Area calculation
51Degrees, alongside breaking down countries by population, calculates the land coverage in km2 for a given area broken down by country.
Performing accurate area calculations for a given geometry presents unique challenges where world projections and planar calculations are necessary for returning accurate values. For example, when dealing with geographic coordinates, the curvature of The Earth matters, especially as the area in question grows larger or is located nearer the poles.
We employ two distinct strategies for area calculation depending on the scale. For individual tree nodes, which generally represent small geographic cells, we use a latitude-adjusted rectangular approximation. This method calculates the width using the cosine of the midpoint latitude to account for longitude lines converging toward the poles, multiplied by the latitude delta. It's computationally lightweight and provides sufficient accuracy for small cells where spherical distortion remains minimal.
For query geometries and country-level attributions, we employ the spherical excess formula, a well-established method for computing polygon areas on a sphere using WGS84 coordinates. The formula works with each vertex, accumulating the signed contribution of each edge segment, accounting for the wrap-around behavior at the antimeridian, producing an exact result for the curved surface of Earth.
When traversing the tree, each node's contribution is weighted by it's intersection ratio with the query polygon. This weighted aggregation naturally produces results that closely approximate the true geometry area. The values are then normalized, scaling the aggregated total to match the spherical area of the query polygon.
To break down by country, rather than scaling country areas proportionally from the aggregated result, we compute the geometric intersections between the query polygon and each country's border geometry. This produces more accurate country-level distributions, especially where the query area spans multiple countries or cuts across country boundaries irregularly. If the geometric approach fails due to topology issues, we gracefully fall back to proportional scaling.
Key Lessons and Trade-offs
Through implementing and using this feature, there are some key lessons and trade-offs. Higher subdivision counts during build time can improve shape fidelity but increase construction costs. Finding the right balance for a dataset(s) requires experimentation.
Dataset resolution should ideally inform tree resolution. If the tree subdivides to a resolution deeper than the data's precision, resources are being wasted and potentially introducing false precision. The provider should respect the inherent resolution of the underlying dataset.
Finally, border handling remains an area for potential improvement. When countries share a border, the algorithm must decide how to handle those edge pixels. resolving aggressively means that population values can get misattributed in large quantities, causing error margins. This is worse when indexing low resolution datasets. Resolving too conservatively, and excessive subdivisions will be created. We use a max depth to cap this, but there's room for more sophisticated approaches.
Production Integration
After the spatial index is built, we use geometries derived from real world analysis to query the index and retrieve the population information. This is then aggregated by country and assigned to property values within our IP Intelligence distribution file. The properties are listed here.
As the main source of these properties comes from values returned by the spatial index, these properties are initially populated only when there is an available Area value.
However, as the CountryCode property is populated using latitude and longitude values, the countries properties can be populated using these as the fallback.
In essence, the only time where no 'countries' information is returned is when no location data exists at all for the IP address.
Conclusion
Our spatial index provides an efficient, generic spatial tree that works across different data types. It's successfully powering our IP to GeoLocation lookups and has proven adaptable enough to handle various geographic datasets. The three-component architecture: provider, divider and algorithm provide flexibility while maintaining strong query performance, allowing us to perform, aggregate and export hundreds of thousands of queries in a short space of time. There's still room for future improvement in aggregation strategies and border handling, but the core system is performing well in production.