Getting Data from ONS Open Geography Portal

Tutorial
ONS Open Geography Portal
REST API
Web data
Ingesting data using Python requests & ArcGIS REST API.
Author

Rich Leyshon

Published

December 15, 2023

Modified

December 8, 2024

Wikimedia commons UK Map.

Introduction

This tutorial is for programmers familiar with Python and how to create virtual environments, but perhaps less familiar with the Python requests package or ArcGIS REST API [1].

If you’re in a rush and just need a snippet that will ingest every UK 2021 LSOA boundary available, here is a GitHub gist just for you.

This blog was updated with the following changes:

  • Endpoint url for LSOA 2021 boundaries had changed.
  • Replace iFrames in blog with images as some users were being prompted to log in.
  • Amend code chunks.

The Scenario

You would like to use python to programmatically ingest data from the Office for National Statistics (ONS) Open Geography Portal. This tutorial aims to help you do this, working with the 2021 LSOA boundaries, the essential features and quirks of the ArcGIS REST API will be explored.

What you’ll need:

requirements.txt
folium
geopandas
mapclassify
matplotlib
pandas
requests

Tutorial

Setting Things Up

  1. Create a new directory with a requirements file as shown above.
  2. Create a new virtual environment.
  3. Install the dependencies listed above.
  4. Create a file called get_data.py or whatever you would like to call it. The rest of the tutorial will work with this file.
  5. Add the following lines to the top of get_data.py and run them, this ensures that you have the dependencies needed to run the rest of the code:
import requests
import geopandas as gpd
import pandas as pd

Finding The Data Asset

One of the tricky parts of working with the GeoPortal is finding the resource that you need.

  1. Access the ONS Open Geography Portal homepage [2].
  2. Using the ribbon menu at the top of the page, navigate to:
    Boundaries Census Boundaries Lower Super Output Areas 2021 Boundaries.
  3. Once you have clicked on this option, a page will open with items related to your selection. Click on the item called “Lower Layer Super Output Areas (2021) Boundaries EW BFC”
  4. This will bring you to the data asset that you need. The webpage should look like the image below.

ONS Open Geography Portal, LSOA 2021 boundaries interface. Click to expand.

ONS Open Geography Portal, LSOA 2021 boundaries interface. Click to expand.

Finding the Endpoint

Now that we have the correct data asset, let’s find the endpoint. This is the url that we will need to send our requests to, in order to receive the data that we need.

  1. Click on the “View Full Details” button.
  2. Scroll down, under the menu “I want to…”, and expand the “View API Resources” menu.
  3. You will see two urls labelled “GeoService” and “GeoJSON”. Click the copy button to the right of the url.
  4. Paste the url into your Python script.
  5. Edit the url string to remove everything to the right of the word ‘query’, including the question mark. Then assign it to a variable called ENDPOINT as below:
ENDPOINT = "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Lower_layer_Super_Output_Areas_December_2021_Boundaries_EW_BFC_V10/FeatureServer/0/query"

This ENDPOINT is a url that we can use to flexibly ask for only the data or metadata, that we require.

Requesting a Single Entry

Now that we’re set up to make requests, we can use an example that brings back only a small slice of the database. To do this, we will need to specify some query parameters. These parameters will get added to our endpoint url and will be interpreted by ArcGIS to serve us only the data we ask for. In this example, I will ask for a single LSOA boundary only by specifying the LSOA code with an SQL clause. For more detail on the flexibility of ArcGIS API, please consult the documentation [1].

Define the below Python dictionary, noting that the syntax and data formats - don’t forget to wrap the LSOA21CD in speech marks:

# requesting a specific LSOA21CD
params = {
    "where": "LSOA21CD = 'W01002029'",
    "outSR": 4326,
    "f": "geoJSON",
    "resultOffset": 0,
    "outFields": "*",
}
1
SQL clauses can go here
2
CRS that you want
3
Response format
4
Parameter used for pagination later
5
This will ensure all available fields are returned

Now I will define a function that will make the request and handle the response for us. Go ahead and define this function:

def request_to_gdf(url:str, query_params:dict) -> gpd.GeoDataFrame:
    """Send a get request to ArcGIS API & Convert to GeoDataFrame.

    Only works when asking for features and GeoJSON format.

    Parameters
    ----------
    url : str
        The url endpoint.
    query_params : dict
        A dictionary of query parameter : value pairs.

    Returns
    -------
    requests.response
        The response from ArcGIS API server. Useful for paginated requests
        later.
    gpd.GeoDataFrame
        A GeoDataFrame of the requested geometries in the crs specified by
        the response metadata.

    Raises
    ------
    requests.exceptions.RequestException
        The response was not ok.
    """
    query_params["f"] = "geoJSON"
    response = requests.get(url, params=query_params)
    if response.ok:
        content = response.json()                
        return (
            response,
            gpd.GeoDataFrame.from_features(
                content["features"],
                crs=content["crs"]["properties"]["name"]
                ))
    else:
        raise requests.RequestException(
            f"HTTP Code: {response.status_code}, Status: {response.reason}"
        )
1
This approach will only work with geoJSON
2
Watch out for JSONDecodeError too…
3
We’ll need the response again later for pagination
4
Best to get crs from response rather than hard-code your expected crs
5
Cases where a traditional bad response may be returned

Briefly, this function is going to ensure the geoJSON format is asked for, as this is the neatest way to bash the response into a GeoDataFrame. It then queries ArcGIS API with the endpoint and parameter you specify. It checks if a status code 200 was returned (good response), if not an exception is raised with the HTTP code and status. Finally, if no error triggered an exception, the ArcGIS response and a GeoDataFrame format of the spatial feature is returned.

Be careful when handling the response of ArcGIS API. Depending on the query you send, it is possible to return status code 200 responses that seem fine. But if the server was unable to make sense of your SQL query, it may result in a JSONDecodeError or even content with details of your error. It is important to handle the various error conditions if you plan to build something more robust than this tutorial and to be exacting with your query strings. For this reason, I would suggest using the params dictionary approach to introducing query parameters rather than attempting to manually format the url string.

With that function defined, we can go straight to a tabular data format, like below:

_, gdf = request_to_gdf(ENDPOINT, params)
gdf.head()
geometry FID LSOA21CD LSOA21NM LSOA21NMW BNG_E BNG_N LAT LONG Shape__Area Shape__Length GlobalID
0 POLYGON ((-3.06378 51.58946, -3.062 51.58922, ... 35661 W01002029 Newport 009G Casnewydd 009G 326721 187882 51.58501 -3.05905 315524.689297 3127.139319 da32f5a1-4451-46d7-a389-5a3915f3dd7f

We can use the GeoDataFrame .explore() method to quickly inspect the fruit of our efforts.

gdf.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

Return All LSOAs in a Local Authority

We probably need to work with more than just a single LSOA, but would prefer not to ingest all of them. Have a look at the available columns in the GeoDataFrame.

geometry FID LSOA21CD LSOA21NM LSOA21NMW BNG_E BNG_N LAT LONG Shape__Area Shape__Length GlobalID
0 POLYGON ((-3.06378 51.58946, -3.062 51.58922, ... 35661 W01002029 Newport 009G Casnewydd 009G 326721 187882 51.58501 -3.05905 315524.689297 3127.139319 da32f5a1-4451-46d7-a389-5a3915f3dd7f

There is a pattern that we can exploit to request every LSOA in a local authority. Have a go at updating params["where"] with an SQL query that can achieve this.

Show the solution
params["where"] = "LSOA21NM like 'Newport%'"

Pass the updated params dictionary to the request_to_gdf function and use the .explore() method to visualise the map. Confirm that the LSOAs returned match what you expected.

Show the solution
_, gdf = request_to_gdf(ENDPOINT, params)
gdf.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

How Many Records Are There?

Update the params dictionary by changing the value of where to '1=1'.

Show the Solution
params["where"] = "1=1"
1
Parameter to get max allowed data, this will get encoded to “where=1%3D1”, see [3] for more on URL-encoding.

For more on why to do this, consult the ArcGIS docs [1]. This is the way to state ‘where=true’, meaning get every record possible while respecting the maxRecordCount. maxRecordCount limits the number of records available for download to 2,000 records in most cases. This is ArcGIS’ method of limiting service demand while not requiring authentication. It also means we need to handle paginated responses.

It’s a good idea to confirm the number of records available within the database. Have a go at reading through the ArcGIS docs [1] to find the parameter responsible for returning counts only. Query the database for the number of records and store it as an integer called n_records.

Show the Solution
# how many LSOA boundaries should we expect in the full data?
params["returnCountOnly"] = True
response = requests.get(ENDPOINT, params=params)
n_records = response.json()["properties"]["count"]
print(f"There are {n_records} LSOAs in total")
There are 35672 LSOAs in total

Paginated Requests

Now we have the number of records, it’s important to go back to collecting geometries. Please update the params dictionary to allow that to happen.

Show the Solution
# lets now return to collecting geometries
del params["returnCountOnly"]
1
Alternatively set this to False

Have a go at requesting the first batch of LSOA boundaries. Count how many you get without attempting to paginate.

Show the Solution
response, gdf = request_to_gdf(ENDPOINT, params)
print(f"There are only {len(gdf)} LSOAs on this page.")
There are only 2000 LSOAs on this page.

Visualise the first 100 rows of the GeoDataFrame you created in the previous step.

Show the Solution
gdf.head(100).explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

We need a condition to check if there are more pages left in the database. See if you can find the target parameter by examining the response properties.

Show the Solution
content = response.json()
more_pages = content["properties"]["exceededTransferLimit"]
print(
    f"It is {more_pages}, that there are more pages of data to ingest..."
    )
1
This is conditional on whether more pages are available.
It is True, that there are more pages of data to ingest...

We are nearly ready to ask for every available LSOA boundary. This will be an expensive request. Therefore to make things go a bit faster, let’s ask for only the default fields by removing params["outFields"].

Show the solution
del params["outFields"]

Now we need to add a new parameter to our params dictionary, with the key resultOffset. We need to send multiple queries to the server, incrementing the value of resultOffset by the number of records on each page in every consecutive request. This may take quite a while, depending on your connection. Add the code below to your python script and run it, then make yourself a cup of your chosen beverage.

offset = len(gdf)
all_lsoas = gdf
while more_pages:
    try:
        params["resultOffset"] += offset
        response, gdf = request_to_gdf(ENDPOINT, params)
        content = response.json()
        all_lsoas = pd.concat([all_lsoas, gdf])
        more_pages = content["properties"]["exceededTransferLimit"]
    except KeyError:
        more_pages = False

all_lsoas = all_lsoas.reset_index(drop=True)
1
Number of records to offset by
2
Append our growing gdf of LSOA boundaries to this
3
Increment the records to ingest
4
Rather than exceededTransferLimit = False, it disappears…

Be careful with the exceededTransferLimit parameter. Instead of being set to False on the last page (as the docs suggest it should) - it actually disappears instead, hence why I use the try:...except clause above. You can attempt to set this parameter explicitly, but I find this makes no difference.

params["returnExceededLimitFeatures"] = "true"
# or
params["returnExceededLimitFeatures"] = True
# both patterns result in the same behaviour as not setting it - the 

Check whether the number of records ingested matches the number expected.

Show the Solution
all_done = len(all_lsoas) == n_records
print(
    f"Does the row count match the expected number of records? {all_done}"
    )
Does the row count match the expected number of records? True

Finally, visualise the last 100 records available within the GeoDataFrame.

Show the Solution
all_lsoas.tail(100).explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

Troubleshooting

One tip I have for troubleshooting queries is to open up the web interface for the ENDPOINT, by pasting it into your web browser, for example:

LSOA 2021 Endpoint Interface. Click to expand.

LSOA 2021 Endpoint Interface. Click to expand.

By using the fields to test out your query parameters and clicking the “Query (GET)” button at the bottom of the page, you can get an indication of whether your query is valid. This is a good place to test out more complex SQL statements for the where parameter:

LSOA 2021 Endpoint Interface. Click to expand.

LSOA 2021 Endpoint Interface. Click to expand.

If you encounter a HTTP 403 Forbidden response, check the endpoint that you are using is still valid. At times, ONS Geoportal change the endpoint. Old endpoints will no longer allow access.

Conclusion

In this tutorial, we have:

  • Demonstrated how to find resources on ONS Open Geography Portal.
  • Found the ArcGIS endpoint url of that resource.
  • Had a brief read through the ArcGIS documentation.
  • Queried the API for a single LSOA code.
  • Discussed a few of the quirks of this API.
  • Retrieved the total number of records available.
  • Used paginated requests to retrieve every record in the database.

A good next step towards a more robust ingestion method would be to consider adding a retry strategy to the requests [4]. For a great overview of the essentials of geographic data and tools, check out my colleague’s fantastic blog on geospatial need-to-knows [5].

Every web API has its own quirks, which is part of the joy of working with web data. I hope this was helpful and all the best with your geospatial data project!

Special Thanks…

…to my colleague Edward, for working through this blog and providing me with really useful feedback.

fin!

References

[1]
[2]
Office for National Statistics, Office for National Statistics Open Geography Portal.” https://geoportal.statistics.gov.uk/
[3]
Refsnes Data, HTML URL Encoding Reference.” https://www.w3schools.com/tags/ref_urlencode.ASP
[4]
ZenRows, Python Requests: Retry Failed Requests [2023].” https://www.zenrows.com/blog/python-requests-retry
[5]
Nicci Potts, Geospatial need-to-knows.” https://niccipotts.netlify.app/blog/geospatialintro/