import requests
import geopandas as gpd
import pandas as pd
Getting Data from ONS Open Geography Portal

Introduction
This tutorial is for programmers familiar with Python and how to create virtual environments, but perhaps less familiar with the Python requests
package or ArcGIS REST API [1].
If you’re in a rush and just need a snippet that will ingest every UK 2021 LSOA boundary available, here is a GitHub gist just for you.
The Scenario
You would like to use python to programmatically ingest data from the Office for National Statistics (ONS) Open Geography Portal. This tutorial aims to help you do this, working with the 2021 LSOA boundaries, the essential features and quirks of the ArcGIS REST API will be explored.
What you’ll need:
requirements.txt
folium
geopandas
mapclassify
matplotlib
pandas requests
Tutorial
Setting Things Up
- Create a new directory with a requirements file as shown above.
- Create a new virtual environment.
- Install the dependencies listed above.
- Create a file called
get_data.py
or whatever you would like to call it. The rest of the tutorial will work with this file. - Add the following lines to the top of
get_data.py
and run them, this ensures that you have the dependencies needed to run the rest of the code:
Finding The Data Asset
One of the tricky parts of working with the GeoPortal is finding the resource that you need.
- Access the ONS Open Geography Portal homepage [2].
- Using the ribbon menu at the top of the page, navigate to:
Boundaries Census Boundaries Lower Super Output Areas 2021 Boundaries. - Once you have clicked on this option, a page will open with items related to your selection. Click on the item called “Lower Layer Super Output Areas (2021) Boundaries EW BFC”
- This will bring you to the data asset that you need. The webpage should look like the image below.
Finding the Endpoint
Now that we have the correct data asset, let’s find the endpoint. This is the url that we will need to send our requests to, in order to receive the data that we need.
- Click on the “View Full Details” button.
- Scroll down, under the menu “I want to…”, and expand the “View API Resources” menu.
- You will see two urls labelled “GeoService” and “GeoJSON”. Click the copy button to the right of the url.
- Paste the url into your Python script.
- Edit the url string to remove everything to the right of the word ‘query’, including the question mark. Then assign it to a variable called
ENDPOINT
as below:
= "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Lower_layer_Super_Output_Areas_December_2021_Boundaries_EW_BFC_V10/FeatureServer/0/query" ENDPOINT
This ENDPOINT
is a url that we can use to flexibly ask for only the data or metadata, that we require.
Requesting a Single Entry
Now that we’re set up to make requests, we can use an example that brings back only a small slice of the database. To do this, we will need to specify some query parameters. These parameters will get added to our endpoint url and will be interpreted by ArcGIS to serve us only the data we ask for. In this example, I will ask for a single LSOA boundary only by specifying the LSOA code with an SQL clause. For more detail on the flexibility of ArcGIS API, please consult the documentation [1].
Define the below Python dictionary, noting that the syntax and data formats - don’t forget to wrap the LSOA21CD in speech marks:
# requesting a specific LSOA21CD
= {
params "where": "LSOA21CD = 'W01002029'",
"outSR": 4326,
"f": "geoJSON",
"resultOffset": 0,
"outFields": "*",
}
- 1
- SQL clauses can go here
- 2
- CRS that you want
- 3
- Response format
- 4
- Parameter used for pagination later
- 5
- This will ensure all available fields are returned
Now I will define a function that will make the request and handle the response for us. Go ahead and define this function:
def request_to_gdf(url:str, query_params:dict) -> gpd.GeoDataFrame:
"""Send a get request to ArcGIS API & Convert to GeoDataFrame.
Only works when asking for features and GeoJSON format.
Parameters
----------
url : str
The url endpoint.
query_params : dict
A dictionary of query parameter : value pairs.
Returns
-------
requests.response
The response from ArcGIS API server. Useful for paginated requests
later.
gpd.GeoDataFrame
A GeoDataFrame of the requested geometries in the crs specified by
the response metadata.
Raises
------
requests.exceptions.RequestException
The response was not ok.
"""
"f"] = "geoJSON"
query_params[= requests.get(url, params=query_params)
response if response.ok:
= response.json()
content return (
response,
gpd.GeoDataFrame.from_features("features"],
content[=content["crs"]["properties"]["name"]
crs
))else:
raise requests.RequestException(
f"HTTP Code: {response.status_code}, Status: {response.reason}"
)
- 1
- This approach will only work with geoJSON
- 2
- Watch out for JSONDecodeError too…
- 3
- We’ll need the response again later for pagination
- 4
- Best to get crs from response rather than hard-code your expected crs
- 5
- Cases where a traditional bad response may be returned
Briefly, this function is going to ensure the geoJSON format is asked for, as this is the neatest way to bash the response into a GeoDataFrame. It then queries ArcGIS API with the endpoint and parameter you specify. It checks if a status code 200 was returned (good response), if not an exception is raised with the HTTP code and status. Finally, if no error triggered an exception, the ArcGIS response and a GeoDataFrame format of the spatial feature is returned.
Be careful when handling the response of ArcGIS API. Depending on the query you send, it is possible to return status code 200 responses that seem fine. But if the server was unable to make sense of your SQL query, it may result in a JSONDecodeError
or even content with details of your error. It is important to handle the various error conditions if you plan to build something more robust than this tutorial and to be exacting with your query strings. For this reason, I would suggest using the params
dictionary approach to introducing query parameters rather than attempting to manually format the url string.
With that function defined, we can go straight to a tabular data format, like below:
= request_to_gdf(ENDPOINT, params)
_, gdf gdf.head()
geometry | FID | LSOA21CD | LSOA21NM | LSOA21NMW | BNG_E | BNG_N | LAT | LONG | Shape__Area | Shape__Length | GlobalID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POLYGON ((-3.06378 51.58946, -3.062 51.58922, ... | 35661 | W01002029 | Newport 009G | Casnewydd 009G | 326721 | 187882 | 51.58501 | -3.05905 | 315524.689297 | 3127.139319 | da32f5a1-4451-46d7-a389-5a3915f3dd7f |
We can use the GeoDataFrame .explore()
method to quickly inspect the fruit of our efforts.
gdf.explore()
How Many Records Are There?
Update the params
dictionary by changing the value of where
to '1=1'
.
Show the Solution
"where"] = "1=1" params[
- 1
- Parameter to get max allowed data, this will get encoded to “where=1%3D1”, see [3] for more on URL-encoding.
For more on why to do this, consult the ArcGIS docs [1]. This is the way to state ‘where=true’, meaning get every record possible while respecting the maxRecordCount
. maxRecordCount
limits the number of records available for download to 2,000 records in most cases. This is ArcGIS’ method of limiting service demand while not requiring authentication. It also means we need to handle paginated responses.
It’s a good idea to confirm the number of records available within the database. Have a go at reading through the ArcGIS docs [1] to find the parameter responsible for returning counts only. Query the database for the number of records and store it as an integer called n_records
.
Show the Solution
# how many LSOA boundaries should we expect in the full data?
"returnCountOnly"] = True
params[= requests.get(ENDPOINT, params=params)
response = response.json()["properties"]["count"]
n_records print(f"There are {n_records} LSOAs in total")
There are 35672 LSOAs in total
Paginated Requests
Now we have the number of records, it’s important to go back to collecting geometries. Please update the params
dictionary to allow that to happen.
Show the Solution
# lets now return to collecting geometries
del params["returnCountOnly"]
- 1
- Alternatively set this to False
Have a go at requesting the first batch of LSOA boundaries. Count how many you get without attempting to paginate.
Show the Solution
= request_to_gdf(ENDPOINT, params)
response, gdf print(f"There are only {len(gdf)} LSOAs on this page.")
There are only 2000 LSOAs on this page.
Visualise the first 100 rows of the GeoDataFrame you created in the previous step.
Show the Solution
100).explore() gdf.head(
We need a condition to check if there are more pages left in the database. See if you can find the target parameter by examining the response properties.
Show the Solution
= response.json()
content = content["properties"]["exceededTransferLimit"]
more_pages print(
f"It is {more_pages}, that there are more pages of data to ingest..."
)
- 1
- This is conditional on whether more pages are available.
It is True, that there are more pages of data to ingest...
We are nearly ready to ask for every available LSOA boundary. This will be an expensive request. Therefore to make things go a bit faster, let’s ask for only the default fields by removing params["outFields"]
.
Show the solution
del params["outFields"]
Now we need to add a new parameter to our params
dictionary, with the key resultOffset
. We need to send multiple queries to the server, incrementing the value of resultOffset
by the number of records on each page in every consecutive request. This may take quite a while, depending on your connection. Add the code below to your python script and run it, then make yourself a cup of your chosen beverage.
= len(gdf)
offset = gdf
all_lsoas while more_pages:
try:
"resultOffset"] += offset
params[= request_to_gdf(ENDPOINT, params)
response, gdf = response.json()
content = pd.concat([all_lsoas, gdf])
all_lsoas = content["properties"]["exceededTransferLimit"]
more_pages except KeyError:
= False
more_pages
= all_lsoas.reset_index(drop=True) all_lsoas
- 1
- Number of records to offset by
- 2
- Append our growing gdf of LSOA boundaries to this
- 3
- Increment the records to ingest
- 4
- Rather than exceededTransferLimit = False, it disappears…
Be careful with the exceededTransferLimit
parameter. Instead of being set to False on the last page (as the docs suggest it should) - it actually disappears instead, hence why I use the try:...except
clause above. You can attempt to set this parameter explicitly, but I find this makes no difference.
"returnExceededLimitFeatures"] = "true"
params[# or
"returnExceededLimitFeatures"] = True
params[# both patterns result in the same behaviour as not setting it - the
Check whether the number of records ingested matches the number expected.
Show the Solution
= len(all_lsoas) == n_records
all_done print(
f"Does the row count match the expected number of records? {all_done}"
)
Does the row count match the expected number of records? True
Finally, visualise the last 100 records available within the GeoDataFrame.
Show the Solution
100).explore() all_lsoas.tail(
Troubleshooting
One tip I have for troubleshooting queries is to open up the web interface for the ENDPOINT
, by pasting it into your web browser, for example:
By using the fields to test out your query parameters and clicking the “Query (GET)” button at the bottom of the page, you can get an indication of whether your query is valid. This is a good place to test out more complex SQL statements for the where
parameter:
If you encounter a HTTP 403 Forbidden response, check the endpoint that you are using is still valid. At times, ONS Geoportal change the endpoint. Old endpoints will no longer allow access.
Conclusion
In this tutorial, we have:
- Demonstrated how to find resources on ONS Open Geography Portal.
- Found the ArcGIS endpoint url of that resource.
- Had a brief read through the ArcGIS documentation.
- Queried the API for a single LSOA code.
- Discussed a few of the quirks of this API.
- Retrieved the total number of records available.
- Used paginated requests to retrieve every record in the database.
A good next step towards a more robust ingestion method would be to consider adding a retry strategy to the requests [4]. For a great overview of the essentials of geographic data and tools, check out my colleague’s fantastic blog on geospatial need-to-knows [5].
Every web API has its own quirks, which is part of the joy of working with web data. I hope this was helpful and all the best with your geospatial data project!
Special Thanks…
…to my colleague Edward, for working through this blog and providing me with really useful feedback.
fin!