Enrich locations and parse with GeoPandas¶
Learn how to enrich and prepare your address data with geospatial analytics in mind. This tutorial runs the Geocoding and Places Details APIs on a few thousand address records. We show how to preprocess tabular data as input and briefly discuss address validation. In addition, you will learn how to leverage the GeoJSON format to store results in GeoDataFrames.
You want to repeat all the steps below on your own data? Check out how to install a Conda virutal environment and make sure to add geobatchpy
and geopandas
.
Part 1 - data preprocessing¶
Our dataset for this tutorial consists of 1081 sport stadiums, mostly across Germany, Belgium, Netherlands. We generated the data using Geoapify's Places API.
import numpy as np
import pandas as pd
pd.options.display.max_columns = None # to display all columns
pd.set_option('display.expand_frame_repr', False) # to print without line break
df = pd.read_csv('stadium-sample.csv', usecols=['Name', 'Street', 'Housenumber', 'Postcode', 'City', 'Country'])
print(f'Total number of records: {df.shape[0]}')
df.head()
Total number of records: 1081
Name | Street | Housenumber | Postcode | City | Country | |
---|---|---|---|---|---|---|
0 | NaN | Kloaver Blatt | NaN | 46342 | Velen | Germany |
1 | Sportanlage Königsfeld, SC Lüstringen | Hasewinkel | NaN | 49086 | Osnabrück | Germany |
2 | Ludo Coeckstadion | Roderveldlaan | NaN | 2600 | Antwerp | Belgium |
3 | NaN | Poggenbeeke | NaN | 32457 | Porta Westfalica | Germany |
4 | Standaard Muide | Galvestonstraat | NaN | 9000 | Ghent | Belgium |
Geoapify's geocoding service accepts free text search and structured input, the latter being helpful only if we have a lot of faith in our data quality. I have seen too many data quality issues in real-world structured address records. And my conclusion is to go for the free text search. Here, we parse the structured data into one string per row.
def concat_columns(df: pd.DataFrame, sep: str = ' ', fill_value: str = np.nan) -> pd.Series:
"""Concatenates row-wise all columns of df and returns series with same index.
"""
return (df.apply(lambda s: sep.join(s.dropna().astype(str).str.strip().replace('', np.nan).dropna()), axis=1)
.replace('', fill_value))
df['AddressLine1'] = concat_columns(df=df[['Street', 'Housenumber']])
df['AddressLine2'] = concat_columns(df=df[['Postcode', 'City']])
addresses = concat_columns(df[['Name', 'AddressLine1', 'AddressLine2', 'Country']], sep=', ', fill_value='')
print(addresses[:5])
0 Kloaver Blatt, 46342 Velen, Germany 1 Sportanlage Königsfeld, SC Lüstringen, Hasewin... 2 Ludo Coeckstadion, Roderveldlaan, 2600 Antwerp... 3 Poggenbeeke, 32457 Porta Westfalica, Germany 4 Standaard Muide, Galvestonstraat, 9000 Ghent, ... dtype: object
Part 2 - add geocodes¶
It is time to geocode our addresses. You can do this using our Python API but we prefer the command line. First we prepare the input file using Python:
from geobatchpy.batch import parse_geocoding_inputs, simplify_batch_geocoding_results
from geobatchpy.utils import write_data_to_json_file, read_data_from_json_file
data = {
'api': '/v1/geocode/search', # This tells Geoapify which service to run.
'inputs': parse_geocoding_inputs(locations=addresses),
'params': {'limit': 1, 'format': 'geojson'}, # We strongly recommend using GeoJSON as output format.
'batch_len': 500, # Results in math.ceil(1081 / 500) = 3 jobs in total.
'id': 'tutorial-batch-geocode'
}
write_data_to_json_file(data=data, file_path='tutorial-geocode-input.json')
It is time to switch to the CLI. First we submit jobs to Geoapify servers with
geobatch submit tutorial-geocode-input.json tutorial-geocode-urls.json
The output of the first step tutorial-geocode-urls.json
is the input for the next:
geobatch receive tutorial-geocode-urls.json tutorial-geocode-results.json
Processing on the Geoapify servers can take some time, depending on the size of your inputs, your subscription plan, and how busy the servers are. Pause here and continue after all your jobs are done.
We convert the results into a simplified list of GeoJSON-like Python dictionaries.
results = read_data_from_json_file('tutorial-geocode-results.json')['results']
results = simplify_batch_geocoding_results(results=results, input_format='geojson')
print('Here is how the 1st one looks like:')
results[0]
Here is how the 1st one looks like:
{'type': 'Feature', 'properties': {'datasource': {'sourcename': 'openstreetmap', 'attribution': '© OpenStreetMap contributors', 'license': 'Open Database License', 'url': 'https://www.openstreetmap.org/copyright'}, 'name': 'Kloaver Blatt', 'street': 'Kloaver Blatt', 'city': 'Velen', 'county': 'Kreis Borken', 'state': 'North Rhine-Westphalia', 'postcode': '46342', 'country': 'Germany', 'country_code': 'de', 'lon': 6.9077083, 'lat': 51.9163524, 'formatted': 'Kloaver Blatt, 46342 Velen, Germany', 'address_line1': 'Kloaver Blatt', 'address_line2': '46342 Velen, Germany', 'timezone': {'name': 'Europe/Berlin', 'offset_STD': '+01:00', 'offset_STD_seconds': 3600, 'offset_DST': '+02:00', 'offset_DST_seconds': 7200, 'abbreviation_STD': 'CET', 'abbreviation_DST': 'CEST'}, 'result_type': 'street', 'rank': {'importance': 0.3, 'popularity': 4.2339429069255905, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}, 'place_id': '51253bdb487ea11b405939ce12094bf54940f00102f90146d27a0000000000c0020492030d4b6c6f6176657220426c617474'}, 'geometry': {'type': 'Point', 'coordinates': [6.9077083, 51.9163524]}, 'bbox': [6.9003041, 51.9159651, 6.9160142, 51.9196131], 'query': {'text': 'Kloaver Blatt, 46342 Velen, Germany', 'parsed': {'street': 'kloaver blatt', 'postcode': '46342', 'city': 'velen', 'country': 'germany', 'expected_type': 'street'}}}
GeoPandas helps us transform the data into a tabular format. The method parses the geometry
into a Shapely geometric object, puts all properties
into separate columns, and ignores the rest. We also set the coordinate reference system (CRS) to 'EPSG:4326' which means that the tuples in the geometries are interpreted as longitude and latitude.
import geopandas as gpd
df_geocodes = gpd.GeoDataFrame.from_features(results, crs='EPSG:4326')
df_geocodes.iloc[:, :4].head()
/opt/miniconda3/lib/python3.9/site-packages/geopandas/_compat.py:112: UserWarning: The Shapely GEOS version (3.10.2-CAPI-1.16.0) is incompatible with the GEOS version PyGEOS was compiled with (3.10.3-CAPI-1.16.1). Conversions between both will be slow. warnings.warn(
geometry | datasource | name | street | |
---|---|---|---|---|
0 | POINT (6.90771 51.91635) | {'sourcename': 'openstreetmap', 'attribution':... | Kloaver Blatt | Kloaver Blatt |
1 | POINT (8.12712 52.26710) | {'sourcename': 'openstreetmap', 'attribution':... | Hasewinkel | Hasewinkel |
2 | POINT (4.44258 51.18693) | {'sourcename': 'openstreetmap', 'attribution':... | Roderveldlaan | Roderveldlaan |
3 | POINT (8.98558 52.25417) | {'sourcename': 'openstreetmap', 'attribution':... | Poggenbeeke | Poggenbeeke |
4 | POINT (3.73198 51.07771) | {'sourcename': 'openstreetmap', 'attribution':... | Galvestonstraat | Galvestonstraat |
And we flatten dict
columns using pandas.json_normalize
:
df_rank = pd.json_normalize(df_geocodes['rank'])
df_rank.head()
importance | popularity | confidence | confidence_city_level | confidence_street_level | match_type | |
---|---|---|---|---|---|---|
0 | 0.30 | 4.233943 | 1.0 | 1.0 | 1.0 | full_match |
1 | 0.41 | 5.600543 | 0.5 | 1.0 | 1.0 | match_by_street |
2 | 0.41 | 6.876627 | 0.5 | 1.0 | 1.0 | match_by_street |
3 | 0.20 | 4.911427 | 1.0 | 1.0 | 1.0 | full_match |
4 | 0.41 | 6.493385 | 0.5 | 1.0 | 1.0 | match_by_street |
Part 3 - validate location data quality¶
Geoapify's Geocoding service primarily is for appending coordinates to your address. But it delivers much more, including what you see in the last table df_rank
. We have published a comprehensive article specifically about data quality here.
The confidence
attributes correlate with quality of the original address inputs. A low score likely means, something went wrong with the corresponding input. That can result in a wrong match, or a match of unexpected type. Below we see what result_type
s have been matched with our data:
df_geocodes['result_type'].value_counts()
street 883 amenity 159 building 19 postcode 8 suburb 8 city 4 Name: result_type, dtype: int64
Often a match of type city
, postcode
, or suburb
is a result of missing relevant attributes in the original data.. But that's something we could have easily found without geocoding. The more interesting cases are those hidden behind invalid combinations of attributes. Row 374
is such an example. Searching online, it turns out that stadium KSV Sottegem
is located in city 9620 Zottegem, street Kloosterstraat. Street Moelde as in the original input is nearby and Godveerdegem seems to be a suburb of Zottegem.
We get a quick overview of all likely erroneous cases with the following block:
use_cols = ['street', 'housenumber', 'postcode', 'city', 'country', 'result_type', 'name']
low_confidence = df_rank['confidence'].le(0.25)
df_geocodes_renamed = df_geocodes[use_cols].rename(columns={col: 'geoapify_' + col for col in use_cols})
pd.concat([df, df_rank[['confidence']], df_geocodes_renamed], axis=1).loc[low_confidence]
Name | Street | Housenumber | Postcode | City | Country | AddressLine1 | AddressLine2 | confidence | geoapify_street | geoapify_housenumber | geoapify_postcode | geoapify_city | geoapify_country | geoapify_result_type | geoapify_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
56 | Alter Sportplatz | Tannenweg | NaN | 31855 | Selxen | Germany | Tannenweg | 31855 Selxen | 0.250000 | NaN | NaN | 31855 | Selxen | Germany | postcode | Selxen |
61 | Bezirkssportanlage Rußheide | B 66 | NaN | 33607 | Bielefeld | Germany | B 66 | 33607 Bielefeld | 0.250000 | NaN | NaN | 33607 | Bielefeld | Germany | suburb | Mitte |
143 | Sportplatz Bierpohlweg | Bierpohlweg | NaN | 32425 | Minden | Germany | Bierpohlweg | 32425 Minden | 0.250000 | NaN | NaN | 32425 | Minden | Germany | suburb | Nordstadt |
174 | Sportplatz | B 64/83 n | NaN | 37671 | Godelheim | Germany | B 64/83 n | 37671 Godelheim | 0.250000 | NaN | NaN | NaN | Höxter | Germany | suburb | NaN |
374 | KSV Sottegem | Moelde | NaN | 9620 | Godveerdegem | Belgium | Moelde | 9620 Godveerdegem | 0.250000 | NaN | NaN | NaN | Godveerdegem | Belgium | city | NaN |
376 | Am Gaa | Am Gaa | NaN | 5483 | Wormeldange-Haut | Luxembourg | Am Gaa | 5483 Wormeldange-Haut | 0.250000 | NaN | NaN | 5485 | Wormeldange-Haut | Luxembourg | postcode | Wormeldange-Haut |
501 | SC Sonnborn 07 | Sonnborner Ufer | NaN | 42327 | Wuppertal | Germany | Sonnborner Ufer | 42327 Wuppertal | 0.250000 | NaN | NaN | 42327 | Wuppertal | Germany | suburb | Gemarkung Vohwinkel |
561 | Sportplatz Lelbach | B 251 | NaN | 34497 | Lelbach | Germany | B 251 | 34497 Lelbach | 0.158333 | Lelbach B | 2 | 34497 | Korbach | Germany | building | NaN |
593 | Sportcentrum Bergheim | A 61 | NaN | 50126 | Thorr | Germany | A 61 | 50126 Thorr | 0.250000 | NaN | NaN | 50126 | Thorr | Germany | postcode | Thorr |
771 | VFB Schrecksbach | B 254 | NaN | 34637 | Schrecksbach | Germany | B 254 | 34637 Schrecksbach | 0.250000 | NaN | NaN | 34637 | Schrecksbach | Germany | postcode | NaN |
819 | Sportplatz Obermeiser | Zum Sportplatz | NaN | 34379 | Obermeiser | Germany | Zum Sportplatz | 34379 Obermeiser | 0.250000 | NaN | NaN | 34379 | Obermeiser | Germany | postcode | Obermeiser |
836 | Bikepark Im Emscherbruch | Im Emscherbruch | 150 | 45892 | Gelsenkirchen | Germany | Im Emscherbruch 150 | 45892 Gelsenkirchen | 0.250000 | NaN | NaN | 45892 | Gelsenkirchen | Germany | suburb | Resse |
873 | Sportverein Seck | K 51 | NaN | 56479 | Seck | Germany | K 51 | 56479 Seck | 0.250000 | NaN | NaN | 56479 | Seck | Germany | city | NaN |
1031 | Stadion am Hötzelenberg | L 117 | NaN | 41812 | Katzem | Germany | L 117 | 41812 Katzem | 0.250000 | NaN | NaN | 41812 | Lövenich | Germany | postcode | Hötzelenberg |
1055 | NaN | B 238 | NaN | Rinteln | Rinteln | Germany | B 238 | Rinteln Rinteln | 0.000000 | Friedrichstraße | NaN | Rinteln | Rinteln | Germany | amenity | Rinteln |
Part 4 - add building details where available¶
Geo-coordinates are the fundament of geographic data analytics. At the very least, we can start visualizing our addresses on a map as points. But we can do much more.
Below we use geo-coordinates as a reference to extract building details. Geoapify also comes with a handy API for this use case called Place Details. It allows us to extract further attributes for a number of different detail typies. We choose building
.
Again, our preference is the CLI, using the batch version of the API:
from geobatchpy.batch import parse_geocodes, simplify_batch_place_details_results
from geobatchpy.utils import write_data_to_json_file
geocodes = parse_geocodes(geocodes=[(item.x, item.y) for _, item in df_geocodes['geometry'].items()])
data = {
'api': '/v2/place-details', # See the Geoapify API docs for other batch APIs.
'inputs': geocodes,
'params': {'features': 'building'}, # can be also several, separated by comma in a single string
'batch_len': 500,
'id': 'tutorial-batch-details'
}
write_data_to_json_file(data=data, file_path='tutorial-details-input.json')
We switch again to the CLI to submit jobs, monitor, and retrieve results:
geobatch submit tutorial-details-input.json tutorial-details-urls.json
The output of the first step tutorial-details-urls.json
is the input for the next:
geobatch receive tutorial-details-urls.json tutorial-details-results.json
Pause here and continue after all your jobs are done. When jobs are done, we parse those again into a simplified list of lists of GeoJSON-like dictionaries.
results = read_data_from_json_file('tutorial-details-results.json')['results']
results = simplify_batch_place_details_results(results)
print('Number of details found per location:')
pd.Series([len(res) for res in results]).value_counts()
Number of details found per location:
0 922 1 159 dtype: int64
In that case, we found building
details for just 159 records. Most likely, because a lot of the stadiums are just sports fields in the middle of nowhere. Or in optimistic words: the absence of building geometries tells us which sports facilities in our dataset are small and which are not.
And out of the 159 buildings not all carry the propoerty of a sport.stadium
. Here is one of the positive examples:
results[26]
[{'type': 'Feature', 'properties': {'feature_type': 'building', 'categories': ['building', 'building.sport', 'sport', 'sport.stadium'], 'datasource': {'sourcename': 'openstreetmap', 'attribution': '© OpenStreetMap contributors', 'license': 'Open Database Licence', 'url': 'https://www.openstreetmap.org/copyright', 'raw': {'name': 'Johan Cruijff ArenA', 'height': '75 m', 'osm_id': -1458682, 'leisure': 'stadium', 'name:mk': 'Амстердам Арена', 'ref:bag': 363100012075730, 'website': 'http://www.amsterdamarena.nl', 'alt_name': 'Johan Cruijf Arena', 'building': 'stadium', 'capacity': 52342, 'operator': 'Stadion Amsterdam N.V.', 'osm_type': 'r', 'wikidata': 'Q207109', 'addr:city': 'Amsterdam', 'wikipedia': 'nl:Johan Cruijff ArenA', 'start_date': 1996, 'addr:street': 'Arena boulevard', 'contact:fax': '0203111380', 'description': 'Johan Cruijff ArenA is a stadium in and has been used for association football, American football, concerts, and other events. The stadium has a retractable roof combined with a grass surface.', 'addr:postcode': '1101AX', 'contact:email': 'info@amsterdamarena.nl', 'contact:phone': '0203111333', 'addr:housenumber': 29}}, 'housenumber': '29', 'street': 'Johan Cruijff Boulevard', 'city': 'Amsterdam', 'state': 'North Holland', 'postcode': '1101AX', 'country': 'Netherlands', 'country_code': 'nl', 'formatted': 'Johan Cruijff ArenA, Johan Cruijff Boulevard 29, 1101 AX Amsterdam, Netherlands', 'address_line1': 'Johan Cruijff ArenA', 'address_line2': 'Johan Cruijff Boulevard 29, 1101 AX Amsterdam, Netherlands', 'lat': 52.314353049999994, 'lon': 4.942843089053988, 'name': 'Johan Cruijff ArenA', 'website': 'http://www.amsterdamarena.nl', 'operator': 'Stadion Amsterdam N.V.', 'description': 'Johan Cruijff ArenA is a stadium in and has been used for association football, American football, concerts, and other events. The stadium has a retractable roof combined with a grass surface.', 'name_other': {'alt_name': 'Johan Cruijf Arena'}, 'name_international': {'mk': 'Амстердам Арена'}, 'contact': {'phone': '0203111333', 'email': 'info@amsterdamarena.nl', 'fax': '0203111380'}, 'wiki_and_media': {'wikidata': 'Q207109', 'wikipedia': 'nl:Johan Cruijff ArenA'}, 'building': {'type': 'stadium', 'start_date': 1996, 'height': '75 m'}, 'area': 30843, 'place_id': '515a84df855dc4134059221c66753b284a40f00101f901fa411600000000009203134a6f68616e20437275696a6666204172656e41'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[4.9400311, 52.314196899], [4.9400318, 52.314195599], [4.9400333, 52.314194599], [4.9406904, 52.313476799], [4.940691, 52.313475499], [4.9406932, 52.313473999], [4.9406964, 52.313473599], [4.9406994, 52.313474299], [4.9407013, 52.313475999], [4.9407015, 52.313477299], [4.9407597, 52.313497099], [4.9407291, 52.313530599], [4.9408613, 52.313575899], [4.9407655, 52.313680699], [4.9407527, 52.313676299], [4.940729, 52.313702199], [4.9408326, 52.313737799], [4.9409801, 52.313576499], [4.9409589, 52.313569199], [4.9409879, 52.313536999], [4.9410768, 52.313521899], [4.9410791, 52.313522699], [4.9411464, 52.313449199], [4.9411441, 52.313448499], [4.941119, 52.313393799], [4.9411485, 52.313361699], [4.9412009, 52.313379599], [4.9412256, 52.313433099], [4.9413477, 52.313371499], [4.9413904, 52.313405299], [4.941426, 52.313388399], [4.9415964, 52.313335499], [4.9415955, 52.313334099], [4.9416065, 52.313331399], [4.9416074, 52.313332799], [4.9418014, 52.313299499], [4.9418009, 52.313297999], [4.9418123, 52.313296699], [4.9418126, 52.313297899], [4.9420133, 52.313288399], [4.9420134, 52.313286899], [4.9420249, 52.313287199], [4.9420249, 52.313288399], [4.9422247, 52.313302999], [4.9422252, 52.313301599], [4.9422367, 52.313303199], [4.9422361, 52.313304699], [4.9424268, 52.313342899], [4.9424278, 52.313341499], [4.942439, 52.313344399], [4.942438, 52.313345799], [4.9425067, 52.313369499], [4.9425044, 52.313372299], [4.9424873, 52.313390399], [4.9425027, 52.313395499], [4.9425035, 52.313394599], [4.9425211, 52.313400499], [4.9425615, 52.313356499], [4.9425603, 52.313356099], [4.942621, 52.313289899], [4.9427627, 52.313340299], [4.9429101, 52.313389499], [4.9428496, 52.313455599], [4.9428485, 52.313455199], [4.9428082, 52.313499199], [4.9428428, 52.313510999], [4.9428587, 52.313490299], [4.9429283, 52.313514199], [4.9429295, 52.313513099], [4.9429391, 52.313517199], [4.9429379, 52.313518199], [4.9430907, 52.313598499], [4.943092, 52.313597699], [4.9431004, 52.313602699], [4.9430989, 52.313603599], [4.9432231, 52.313700499], [4.9432249, 52.313699799], [4.9432316, 52.313705899], [4.9432297, 52.313706599], [4.9433209, 52.313816499], [4.9433229, 52.313815999], [4.9433272, 52.313822699], [4.9433251, 52.313823199], [4.9433796, 52.313941899], [4.9433814, 52.313941699], [4.9433837, 52.313948799], [4.9433816, 52.313949099], [4.9433915, 52.314026199], [4.9433963, 52.314066199], [4.9433955, 52.314084999], [4.9433875, 52.314124699], [4.9433571, 52.314122299], [4.9433547, 52.314140199], [4.9433529, 52.314154099], [4.9433813, 52.314156099], [4.9433725, 52.314201699], [4.9433753, 52.314201899], [4.9433738, 52.314210099], [4.9433103, 52.314326399], [4.9433057, 52.314332999], [4.9433038, 52.314332499], [4.9432054, 52.314439999], [4.9432787, 52.314465199], [4.9431485, 52.314611599], [4.943064, 52.314699999], [4.9429969, 52.314773299], [4.9428835, 52.314897399], [4.9427597, 52.315032799], [4.9426882, 52.315008199], [4.9426804, 52.315016799], [4.9426783, 52.315016099], [4.942588, 52.315114799], [4.9425901, 52.315115699], [4.9425833, 52.315121599], [4.9425706, 52.315115999], [4.942571, 52.315115599], [4.9425579, 52.315109299], [4.9424325, 52.315201199], [4.9424429, 52.315208699], [4.9424435, 52.315208399], [4.9424534, 52.315216199], [4.9424439, 52.315220799], [4.9424424, 52.315219499], [4.9422907, 52.315292599], [4.9422906, 52.315293499], [4.9422697, 52.315302899], [4.9422659, 52.315302399], [4.9420952, 52.315355299], [4.9420961, 52.315356699], [4.9420847, 52.315359399], [4.9420841, 52.315358399], [4.9418905, 52.315391399], [4.9418908, 52.315392699], [4.9418789, 52.315393799], [4.9418786, 52.315392799], [4.9416778, 52.315402399], [4.9416778, 52.315403699], [4.9416661, 52.315403499], [4.9416661, 52.315402299], [4.9414664, 52.315387799], [4.941466, 52.315389099], [4.9414544, 52.315387499], [4.941455, 52.315386099], [4.941264, 52.315348099], [4.9412631, 52.315349299], [4.9412522, 52.315346299], [4.9412531, 52.315345099], [4.9411841, 52.315321399], [4.9412025, 52.315301599], [4.9411697, 52.315290199], [4.9411293, 52.315334299], [4.9411305, 52.315334799], [4.9410699, 52.315400999], [4.9409215, 52.315351499], [4.9407806, 52.315301499], [4.9408414, 52.315235199], [4.9408426, 52.315235599], [4.9408829, 52.315191699], [4.9408498, 52.315180599], [4.940832, 52.315200499], [4.9407632, 52.315176899], [4.9407621, 52.315177899], [4.9407523, 52.315173699], [4.9407535, 52.315172699], [4.9406001, 52.315092399], [4.9405989, 52.315093099], [4.9405909, 52.315087899], [4.940592, 52.315087299], [4.9404679, 52.314990399], [4.9404661, 52.314991199], [4.9404597, 52.314985099], [4.9404615, 52.314984399], [4.9403701, 52.314874499], [4.9403683, 52.314874899], [4.9403641, 52.314868299], [4.9403658, 52.314867899], [4.9403115, 52.314748999], [4.9403096, 52.314749299], [4.9403073, 52.314742199], [4.9403098, 52.314741899], [4.9402949, 52.314624999], [4.9402951, 52.314605799], [4.9403177, 52.314489399], [4.9403156, 52.314489099], [4.9403163, 52.314487399], [4.9402655, 52.314481099], [4.9402538, 52.314496799], [4.9401649, 52.314512099], [4.9401126, 52.314494099], [4.9401417, 52.314461999], [4.9402305, 52.314446699], [4.9402331, 52.314447499], [4.9403007, 52.314374099], [4.9402983, 52.314373099], [4.9402732, 52.314318699], [4.9403027, 52.314286499], [4.9403242, 52.314293999], [4.9405664, 52.314028999], [4.9404628, 52.313993399], [4.9404334, 52.314025599], [4.9404462, 52.314029999], [4.9402613, 52.314232199], [4.9401293, 52.314186599], [4.9400986, 52.314220099], [4.94004, 52.314199899], [4.940037, 52.314200699], [4.9400349, 52.314200499], [4.940033, 52.314199899], [4.9400322, 52.314199399], [4.9400313, 52.314198199], [4.9400311, 52.314196899]], [[4.9409336, 52.314610199], [4.9418733, 52.314932799], [4.9422118, 52.314561099], [4.942288, 52.314477399], [4.9423649, 52.314392899], [4.9426665, 52.314061699], [4.9417196, 52.313725299], [4.9414104, 52.314073299], [4.9413415, 52.314150799], [4.9412606, 52.314241799], [4.9409336, 52.314610199]]]}}]
We use GeoPandas to parse the building details into a GeoDataFrame, with index referencing to our original addresses:
index_name = df.index.name if df.index.name is not None else 'index'
df_details = pd.concat([gpd.GeoDataFrame.from_features(res).assign(**{index_name: idx})
for idx, res in zip(df.index, results)]).set_index(index_name)
# df_details.geometry = gpd.GeoSeries(df_details.geometry).set_crs('EPSG:4326')
df_details = gpd.GeoDataFrame(df_details, crs='EPSG:4326')
df_details.head()
geometry | feature_type | categories | datasource | street | city | state | postcode | country | country_code | formatted | address_line1 | address_line2 | lat | lon | building | area | place_id | housenumber | county | name | operator | facilities | wiki_and_media | website | description | name_other | name_international | contact | opening_hours | heritage | owner | place_of_worship | catering | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||||||||||||||||||||||||||
1 | POLYGON ((8.12688 52.26715, 8.12704 52.26712, ... | building | [building, building.residential] | {'sourcename': 'openstreetmap', 'attribution':... | Hasewinkel | Osnabrück | Lower Saxony | 49086 | Germany | de | Hasewinkel, 49086 Osnabrück, Germany | Hasewinkel | 49086 Osnabrück, Germany | 52.267078 | 8.127102 | {'type': 'residential'} | 79.0 | 517b8a16b8024120405948a78d5632224a40f00102f901... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 | POLYGON ((4.83247 50.99205, 4.83254 50.99192, ... | building | [building, building.sport, sport, sport.stadium] | {'sourcename': 'openstreetmap', 'attribution':... | Grote Laakweg | Aarschot | Flemish Brabant | 3200 | Belgium | be | Sporting Club Aarschot, Grote Laakweg 29, 3200... | Sporting Club Aarschot | Grote Laakweg 29, 3200 Aarschot, Belgium | 50.992094 | 4.833120 | NaN | 1448.0 | 514e519e332255134059449d122ffd7e4940f00102f901... | 29 | Leuven | Sporting Club Aarschot | S.C.Aarschot | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17 | POLYGON ((8.22616 53.14664, 8.22618 53.14653, ... | building | [building, building.sport, sport, sport.stadiu... | {'sourcename': 'openstreetmap', 'attribution':... | Europaplatz | Oldenburg | Lower Saxony | 26123 | Germany | de | kleine EWE Arena, Europaplatz 12, 26123 Oldenb... | kleine EWE Arena | Europaplatz 12, 26123 Oldenburg, Germany | 53.146592 | 8.226715 | {'levels': 3, 'type': 'sports_centre', 'materi... | 4327.0 | 512fd5b5fc1374204059f2c05778c4924a40f00102f901... | 12 | NaN | kleine EWE Arena | Weser-Ems-Halle | {'wheelchair': True} | {'wikidata': 'Q879570', 'wikipedia': 'de:EWE A... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21 | POLYGON ((6.05971 49.61126, 6.05976 49.61124, ... | building | [building] | {'sourcename': 'openstreetmap', 'attribution':... | Rue de Luxembourg | Bertrange | NaN | 8077 | Luxembourg | lu | 64 Rue de Luxembourg, 8077 Bertrange, Luxembourg | 64 Rue de Luxembourg | 8077 Bertrange, Luxembourg | 49.611280 | 6.059800 | NaN | 152.0 | 515f1ca3f8403d184059870f29cc3ece4840f00102f901... | 64 | Canton Luxembourg | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26 | POLYGON ((4.94003 52.31420, 4.94003 52.31420, ... | building | [building, building.sport, sport, sport.stadium] | {'sourcename': 'openstreetmap', 'attribution':... | Johan Cruijff Boulevard | Amsterdam | North Holland | 1101AX | Netherlands | nl | Johan Cruijff ArenA, Johan Cruijff Boulevard 2... | Johan Cruijff ArenA | Johan Cruijff Boulevard 29, 1101 AX Amsterdam,... | 52.314353 | 4.942843 | {'type': 'stadium', 'start_date': 1996, 'heigh... | 30843.0 | 515a84df855dc4134059221c66753b284a40f00101f901... | 29 | NaN | Johan Cruijff ArenA | Stadion Amsterdam N.V. | NaN | {'wikidata': 'Q207109', 'wikipedia': 'nl:Johan... | http://www.amsterdamarena.nl | Johan Cruijff ArenA is a stadium in and has be... | {'alt_name': 'Johan Cruijf Arena'} | {'mk': 'Амстердам Арена'} | {'phone': '0203111333', 'email': 'info@amsterd... | NaN | NaN | NaN | NaN | NaN |
Part 5 - quick intro into GeoPandas data manipulation¶
We cover two common GeoPandas data manipulation capabilities:
- Component-wise summary statistics for geometries.
- Geospatial joins.
Remember that df_details
contains geometries about buildings. One such example is given below for the Johan Cruijff Arena
which is line 26 in df_details
:
Now with the help of GeoPandas we can easily compute the occupied area and total length of boundaries. We only need to make sure to use a CRS that makes sense for our data. Here we pick 'EPSG:3035' which works well for European geometries and translates to the metric system:
(pd.concat([df_details.geometry.to_crs(crs='EPSG:3035').area,
df_details.geometry.boundary.to_crs(crs='EPSG:3035').length], axis=1)
.rename(columns={0: 'Area', 1: 'BoundaryLength'}).head())
Area | BoundaryLength | |
---|---|---|
index | ||
1 | 79.595084 | 36.810346 |
10 | 1449.534940 | 213.284523 |
17 | 4335.556706 | 234.403371 |
21 | 152.444623 | 66.759744 |
26 | 30895.358101 | 1366.622927 |
We illustrate geospatial joins with the help of a 2nd dataset of airports from opentraveldata, under CC BY 4.0 license. The data contains more than just geocoordinates and airport names but we keep it simple for now:
from shapely.geometry import Point
df_airports = pd.read_csv('airports.csv')
df_airports = (df_airports.loc[df_airports.type.isin(['medium_airport', 'large_airport'])]
.rename(columns={'name': 'AirportName'}).reset_index(drop=True))
geometry = [Point(xy) for xy in zip(df_airports.longitude_deg, df_airports.latitude_deg)]
df_airports = gpd.GeoDataFrame(df_airports, geometry=geometry, crs='EPSG:4326')[['geometry', 'AirportName']]
df_airports.head()
geometry | AirportName | |
---|---|---|
0 | POINT (-158.61800 59.28260) | Aleknagik / New Airport |
1 | POINT (69.80734 33.28460) | Khost International Airport |
2 | POINT (160.05499 -9.42800) | Honiara International Airport |
3 | POINT (157.26300 -8.32797) | Munda Airport |
4 | POINT (102.35224 32.53154) | Hongyuan Airport |
A geospatial join based on proximity answers us which is the closest airport for every address in our original data, and how far away is that airport (bird distance). Again, we use a CRS working well for European geometries and translating into the metric system:
df_join = (df_geocodes.to_crs('EPSG:3035')
.sjoin_nearest(df_airports.to_crs('EPSG:3035'),
how='left', distance_col='DistanceToAirport')
.to_crs('EPSG:4326').rename(columns={'AirportName': 'ClosestAirport'})
.drop('index_right', axis=1))
show_cols = ['formatted', 'ClosestAirport', 'DistanceToAirport']
df_join[show_cols]
formatted | ClosestAirport | DistanceToAirport | |
---|---|---|---|
0 | Kloaver Blatt, 46342 Velen, Germany | Twente Airport | 40024.716024 |
1 | Hasewinkel, 49086 Osnabrück, Germany | Münster Osnabrück Airport | 33641.511932 |
2 | Roderveldlaan, 2600 Antwerp, Belgium | Antwerp International Airport (Deurne) | 1267.311428 |
3 | Poggenbeeke, 32457 Porta Westfalica, Germany | Bückeburg Air Base | 7128.372785 |
4 | Galvestonstraat, 9000 Ghent, Belgium | Wevelgem Airport | 47020.704761 |
... | ... | ... | ... |
1076 | Mühltalweg, 67551 Worms, Germany | Mannheim-City Airport | 23035.716161 |
1077 | Omer Vanaudenhovelaan, 3290 Diest, Belgium | Beauvechain Air Base | 32774.379561 |
1078 | Uferstraße, 35576 Wetzlar, Germany | Siegerland Airport | 33732.677575 |
1079 | Am Pfad, 52525 Braunsrath, Germany | Geilenkirchen Air Base | 10500.661219 |
1080 | Am Jungenwald, 66346 Püttlingen, Germany | Saarbrücken Airport | 19407.536819 |
1081 rows × 3 columns