Spatial data is pervasive and essential. It affects everything from our daily commute to our understanding of global environmental issues. The rapid advancement of technology has led to a massive increase in the availability and diversity of spatial data sources. The affordability and scalability of data storage, processing, and transmission have enabled this unprecedented growth.
However, spatial data scientists still face significant challenges in analyzing and interpreting these large and complex datasets. They require efficient and specialized tools that can handle the distinctive features of spatial data types. In this post, we will explore some of the useful tools for spatial data science. But first, let us define what spatial data science actually entails.
What is Spatial Data Science?
Drawing techniques from data science, geographic information systems (GIS) and spatial statistics, spatial data science is focused on unearthing insights from data with a location component. Spatial data science can be used to solve a broad range of problems, such as predicting disease outbreaks, optimizing transportation routes, analyzing patterns of urban growth, and monitoring changes in the environment.
While this work was traditionally conducted by GIS analysts and in geo-related academia, an increasing share of spatial work is now being undertaken by data scientists with access to spatial data. The spatial tooling has changed accordingly, with many powerful data science tools being extended to make use of spatial data.
Spatial data comes in two primary forms: vector and raster. Each requires a different set of tools, though traditional GIS applications can work with both. Ideal for precise representation of distinct entities, such as roads or buildings, vector data can describe points, lines or polygons. Better suited for representing continuous data, such as satellite imagery or topography, raster data is stored as a multidimensional grid, with two of the dimensions usually spatial in nature and aligned with the earth’s surface.
Still the workhorses of spatial work, GIS applications focus on a GUI, with some scripting capabilities available. Standing out from the free open source crowd is QGIS. QGIS has been in development for more than two decades. It contains a comprehensive set of tools for managing, manipulating, analyzing, and visualizing both vector and raster data. QGIS can also be used to process and analyze point clouds, a type of point vector data commonly collected by LiDAR sensors.
Despite its age, QGIS is an excellent tool for editing vector data. As such, it is commonly used for labeling satellite imagery in spatial machine learning workflows. More lightweight tools better suited to the modern team-based ML workflow are springing up, such as Microsoft’s Spatial Imagery Labeling Toolkit.
Consider a customer dataset with typical attributes, such as location, personal information, and purchase history. These types of relationships are usually held in a database and queried in a standard manner. Spatial databases extend databases with spatial data types, functions, and indices that allow for efficient queries of a spatial nature that would be difficult or impossible otherwise. In our customer dataset example, a spatial database would allow us to perform basic queries such as selecting customers by geographic area or ranking customers by distance from a location. More advanced queries include spatially joining regional business characteristics with individual customer data, and grouping customers by their nearest business locations.
Many database technologies, such as Snowflake and Microsoft SQL Server, have native spatial support to varying degrees. PostGIS is one of the most well known and complete spatial databases available. PostGIS is built on PostgreSQL, a popular open source database, which means it can run anywhere PostgreSQL can, including managed database services, such as Amazon Aurora and Azure Database. GIS applications such as QGIS include a PostGIS connector, making it easy to query and modify the spatial database.
Readily understandable for data scientists familiar with pandas, GeoPandas is a Python package that extends the datatypes used by pandas to allow spatial operations on vectors (held as geometry objects). GeoPandas can be used in much the same way that the spatial functions of a spatial database are employed to select, modify, join or aggregate tabular data via the geometry column. GeoPandas is so central to spatial data science that it has spawned its own ecosystem of dependent packages. Notable among these is PySAL, the Python Spatial Analysis Library for spatial data science on vector data.
Xarray is a Python package for handling gridded data, including rasters. It extends NumPy-like multidimensional arrays with labels for dimensions, coordinates, and other attributes. By doing so, Xarray provides a better experience for working with multidimensional arrays in a similar manner to how pandas improves the way we interact with tabular data. Xarray is particularly useful for spatial data science because it lets us interact with our data via real world coordinates, such as latitude and longitude, instead of by image pixel locations, making it easier to manipulate and analyze rasters. Additionally, Xarray is backed by dask, which applies parallel processing to array manipulation, allowing us to work with arrays that are larger than memory (a common occurrence) and at greater speed.
Like GeoPandas, Xarray forms the center of an ecosystem of packages. Of note for spatial data science are rioxarray, Xarray Spatial, and stackstac.
rioxarray implements functions for opening common raster data types, performing common raster manipulations, such as clipping and merging, as well as reprojecting raster between different projections.
Xarray Spatial implements common raster analysis functions for analyzing topographic data, satellite imagery, and rasterized versions of vector data. The latter is particularly useful when working with large, complex vector datasets, such as national road networks or demographics.
stackstac allows us to interact with a STAC as an Xarray object. STAC (SpatioTemporal Asset Catalog) is a metadata specification for raster datasets that standardizes the way developers and applications interact with raster data. The upshot of this is that we can use stackstac to query any STAC, such as for satellite imagery or climate data, with the same syntax and parameters, like time period and geographic extent. stackstac also allows us to work efficiently with large raster datasets held in the cloud because dask uses lazy computation; subsets of data are only downloaded when they are required in the execution graph.
Modern machine learning tools for spatial data lag behind that of other data types, such as regular imagery and natural language. Until recently, spatial ML has involved a lot of custom boilerplate code to prepare data and process model outputs. The development of TorchGeo, implemented as part of the popular PyTorch project for ML, provides a continually expanding set of datasets, geographic samplers, common image transforms, model architectures, and pre-trained models for spatial data. Many useful freely available datasets, such as satellite imagery from NASA’s Landsats or ESA’s Sentinels, are already covered by TorchGeo.
Spatial data science is a rapidly growing field, both in terms of the contexts it’s being used within and the volume of spatial data available. Therefore, a central challenge is finding a high performance toolset that leverages the unique characteristics of spatial data. We hope the set of tools we’ve introduced in this article is a helpful starting point for your next spatial data science project.
If you are looking for expert guidance and support to develop your own geospatial machine learning solution, please contact our team of professionals at Strong Analytics. We are ready to help you achieve your goals!