Responsible AI for Developers Blog: Open Buildings Dataset: Sampling a Large Geospatial Dataset Using Pandas, S2 Geometry and Apache Beam

Posted by Oscar Wahltinez, Developer Programs Engineer

In a previous blog post, we introduced the Open Buildings dataset and covered how to download it using a couple of different options. Here, we'll cover how to sample the dataset using different techniques and what biases are associated with those samples.

Footprints extracted from the Open Buildings dataset

Random sampling using Pandas

Assuming that you have sufficiently available memory (40-50 GB of RAM) and are only interested in the locations of buildings (ignoring their footprints), you can just load the entire dataset as previously covered and then call pandas.DataFrame.sample() to obtain a random sample:

# DataFrame containing all data in-memory.
df: pd.DataFrame = ...

# Sample random elements from the dataset.
sample_size = 100_000
sample = df.sample(sample_size)

This approach has a couple of drawbacks:

It requires a large amount of memory, especially if we also want to get the building footprints.
Clusters with a high building density will be far more represented than rural areas. With a small sample size, you may not get a single building outside of big cities!

Those drawbacks may be acceptable for your use case, but if you want to explore alternative trade-offs, read on.

Class imbalance in buildings data

Random sampling is a great default, but it falls short when the data suffers from class imbalance. Presumably, some of the data analysis that can be performed on the Open Buildings dataset will require the presence of data from both urban and rural areas. Using random sampling can lead to an under-representation of rural areas, or even missing them altogether. Consider the following data:

Randomly choosing 5 samples (half of the data) with replacement has a (0.9)⁵ ≈ 59% probability of selecting only elements classified as urban. Similarly, the vast majority of buildings are in non-rural areas. Arguably, having few buildings is what makes rural areas, well, rural!

This problem can be tackled from multiple angles, and each approach will make a series of trade-offs. It's your job as a scientist, researcher or developer to choose the appropriate set of trade-offs which will depend on your particular use case.

Addressing geospatial bias

One way to look at this class imbalance problem is to acknowledge that the samples are clustered around big population centers. To avoid that problem, we can introduce a bias to our sample selection process and remove samples that are considered to be too close to each other. The simplest method would be to randomly draw samples from the dataset, compute the distance to all previously selected buildings, and only if a minimum distance threshold is met we add them to the selection. Here's what that might look like:

import pandas as pd
from geopy.distance import great_circle

# Read all buildings within the S2 cell with token 0ef.
url_root = "gs://open-buildings-data/v2"
poly_path = f"{url_root}/points_s2_level_4_gzip"
df = read_pandas_csv(f"{poly_path}/0ef_buildings.csv.gz", compression="gzip")

# Helper function to extract latitude and longitude from building objects.
get_lat_lng = lambda x: (x["latitude"], x["longitude"])
selection = []
sample_size = 1_000
threshold_meters = 500

# Draw random samples until we have 1,000 buildings at least 500m apart.
# `DataFrame.sample(frac=1)` is a shortcut used to shuffle the dataset.
# NOTE: Don't do this unless you need very few samples.
for _, building in df.sample(frac=1).iterrows():
 latlng = get_lat_lng(building)
 # As soon as any distance does not meet the threshold stop computing them.
 distance_func = lambda x: great_circle(latlng, get_lat_lng(x)).meters
 if all(distance_func(x) >= threshold_meters for x in selection):
   selection.append(building.to_dict())
   if len(selection) >= sample_size: break

A few things worth noting in the above code snippet:

We are using the great circle distance formula, which assumes the Earth is a perfect sphere. This is an approximation, but good enough for our purposes of sampling.
For each sample added to the selection, the loop gets slower and slower. This is because the number of distance comparisons grows linearly with the number of samples at each loop iteration.

Unfortunately, this is very slow. It may work well for a small amount of samples, but for the size needed for statistical analysis this is likely not performant enough. The good news is, there is yet another geospatial sampling method we can use.

Using geospatial hashing for sampling

A hash function maps an arbitrarily-sized set of inputs to a fixed-sized set of outputs. This is frequently used for strings and files, but the same concept can be applied to geospatial entities such as buildings. One way to do this is by dividing the region of interest into a grid, and then mapping each point within the grid to the corresponding cell that it falls under.

Assuming the hash function is relatively cheap to compute, we can enforce the geographical distribution bias in our sampling by ensuring that each individual cell in the grid contains at most a single sample. Here's an illustration of the methodology:

Illustration of geospatial hashing using arbitrary shape-filling polygons

There are many possible hashing functions, the one chosen here is mapping from <lat, lng> coordinates to an S2 cell (from the S2 geometry library) of a predetermined level. Although there are no guarantees for the minimum distance between two samples, since they can be near the corresponding edges of two adjacent cells, in those cases the samples from the other neighboring cells will be farther apart – thus, on average, the minimum distance between a sample and its closest neighbor is expected to be relatively close to the average distance between neighboring cell centroids, assuming every cell produces a sample.

Here's an example implementation of this sampling methodology using the s2sphere Python package:

import s2sphere as s2

df: pd.DataFrame = ...
get_lat_lng = lambda building: ...

def s2_cell_at_lat_lng(lat, lng, level=30):
  """Helper function to retrieve S2 cell of <level> at <lat,lng> coordinates."""
  latlng = s2.LatLng.from_degrees(lat, lng)
  return s2.CellId.from_lat_lng(latlng).parent(level)

cell_level = 14
sample_size = 1_000

selection = []
cell_tokens = set()

# Iterate over all (shuffled) buildings and add them to the selection if there are
# no hash collisions.
# NOTE: df.sample(frac=1) shuffles the DataFrame.
for _, building in df.sample(frac=1).iterrows():
  latlng = get_lat_lng(building)
  token = s2_cell_at_lat_lng(*latlng, level=cell_level).to_token()
  if token not in cell_tokens:
    cell_tokens.add(token)
    selection.append(building.to_dict())
    if len(selection) >= sample_size: break

Here, we are using S2 cells at level 14, which is approximately 500 x 500 meters. Therefore, it's expected that the buildings sampled will have at least an average distance of 500 meters between each sample and its closest neighbor. Here's how you can test that theory:

import scipy.spatial.distance
from geopy.distance import great_circle

# Compute a 2D matrix of pairwise distances.
latlngs = [(x['latitude'], x['longitude']) for x in selection]
distance_function = lambda a, b: great_circle(a, b).meters
distances = scipy.spatial.distance.pdist(latlngs, metric=distance_function)
distance_matrix = scipy.spatial.distance.squareform(distances)

# Set all distances between a sample and itself to infinity, otherwise the
# average minimum distance will be zero.
distance_matrix[distance_matrix == 0] = float('inf')

# Compute the mean distance between each sample and the closest pair.
distance_matrix.min(axis=1).mean()

This produces an average minimum distance of 7,291 meters. Much more than 500 meters! That's because not every cell contains a building and we are only sampling 1.2% of the buildings in this particular cluster.

While the distance-based sampling methodology takes several minutes to finish, applying geospatial hashing completes its sampling in just a few seconds.

Scaling the processing Apache Beam

Although it takes a few seconds to complete, that's only for the processing of a single S2 cell cluster – and the Open Buildings dataset is divided into 129 clusters.

This is not a huge number, but learning how to parallelize the processing will come handy for potentially much larger future datasets. There are many ways to achieve this task, here we'll cover how to scale the processing of our data using Apache Beam.

Apache Beam is a very complex, mature software product. While getting started with it can be relatively easy, developing true expertise can take a very long time. There are many guides out there that teach you how to use Apache Beam, and the documentation has a great programming guide. Covering the basics of Apache Beam is well beyond the scope of this post, but if you never heard of it before and feel completely lost, here's what a basic beam pipeline looks like:

[Input PCollection] | [Transform] | [Output Writer]

In this instance, a desirable input PCollection would be the set of S2 clusters. This is because, thanks to the hierarchical properties of S2 cells, our hashing function is guaranteed not to have a collision across different clusters; so we can safely parallelize and process each S2 cluster independently. To make it as simple as possible for the individual workers, we can process the S2 clusters as URLs that can be loaded directly by Pandas:

# Get all S2 cell tokens that contain buildings data.
# NOTE: Reading files directly from GCS is faster than the http REST endpoint.
url_root = "gs://open-buildings-data/v2"
# url_root = "https://storage.googleapis.com/open-buildings-data/v2"
tokens = read_pandas_csv(f"{url_root}/score_thresholds_s2_level_4.csv").s2_token

# The polygon type can be "points" (centroid) or "polygon" (footprint).
poly_type = "points"  #@param ["points", "polygons"]

# Create a list with all URLs that we must download data from.
fnames = [f"{token}_buildings.csv.gz" for token in tokens]
poly_path = f"{url_root}/{poly_type}_s2_level_4_gzip"
urls = [f"{poly_path}/{fname}" for fname in fnames]

# Save the header columns, which are shared across all clusters.
columns = read_pandas_csv(urls[0], compression="gzip", nrows=0).columns

Next, we can write a function that will process each of those URLs and produce individual output records. To avoid the hassle of quoting fields, let's output the records in TSV format as a string. This largely builds on the previously defined methodology:

def process_token_url(url, columns=None):
 cell_level = 14
 sample_size = 1_000

 sample_count = 0
 cell_tokens = set()

 # Iterate over all (shuffled) buildings and return them if there are
 # no hash collisions.
 df = read_pandas_csv(url, compression="gzip")
 for _, building in df.sample(frac=1).iterrows():
   latlng = building["latitude"], building["longitude"]
   # Get the S2 cell token corresponding to <lat, lng>.
   token = s2_cell_at_lat_lng(*latlng, level=cell_level)
   if token not in cell_tokens:
     sample_count += 1
     cell_tokens.add(token)
     # Return the building as a TSV line.
     yield "\t".join(str(building[col]) for col in columns)
     # Once we reach our desired sample size, we can stop.
     if sample_count >= sample_size: break

Finally, you can put everything together in a beam pipeline that outputs all records into sharded TSV files. Note that we use FlatMap instead of Map because process_token_url() outputs individual records, and we want all records from all dataset clusters to be in a single, flat list:

import apache_beam as beam

# These options are used for testing purposes and work on Colab. To take
# advantage of parallel processing, you should adjust them to your resources.
opts = dict(direct_running_mode='multi_threading', direct_num_workers=8)
with beam.Pipeline(options=beam.pipeline.PipelineOptions(**opts)) as pipeline:
 _ = (
     pipeline
     | beam.Create(urls)
     | beam.FlatMap(process_token_url, columns=columns)
     | beam.io.WriteToText("output/data", ".tsv", header="\t".join(columns))
 )

Using Beam's DirectRunner on a modest machine (by today's standards!) with 32 cores, processing and sampling all of the dataset clusters takes less than 10 minutes. Not too bad. 😎

Code and data availability

The code used in this blog post is available as part of the following Github repositories and gists:

https://gist.github.com/owahltinez/e1df3c41f0c90c54fe3546c5118ca019