library(here)
library(janitor)
library(glue)
library(httr)
library(jsonlite)
library(listviewer)
library(tidyverse)
<-
obo_offence_category_stats_csv read_csv("https://www.dhi.ac.uk/api/data/csv/agg/oldbailey_offence?series=offence_category")
API terms and statistics
Introduction
There were a couple of convenient features in the old OBAPI whose absence I’ve been asked about a number of times since the API changed. The first was the terms endpoint, which returned a JSON list of all the fields that could be queried, their types and possible values. The second was the breakdown
parameter which could be added to API queries to include a frequency table of terms in the results, and was very handy for statistical analysis.
Until very recently I thought there was no way to reproduce either functionality in the new API without constructing lists manually and/or running a lot of queries using the main search endpoint. And, as far as I know, there aren’t any exact equivalents. But it turns out you can get (most of) the same information, plus more that wasn’t queryable before.
However, it’s undocumented and not easy to use without some background knowledge about the underlying data and how search functions work. So I’m writing this to fill in the gaps (and provide a resource that I can point enquirers to in the future).
Note that I’m still learning about the functions and this post may be subject to minor revisions.
Discovery
The solution to both problems lies in the statistics search. When you run a search there, you can see a “Download this data” link at the bottom of the table or visualisation it displays; if you click on the link it’ll download a CSV file. That’s pretty useful.
But it gets better when you inspect the URL in the download link. (If you’re not familiar with doing that, the Programming Historian has a quick primer.)
This is the download URL for a default offence categories search:
- https://www.dhi.ac.uk/api/data/csv/agg/oldbailey_offence?series=offence_category
It looks very much like an API URL (“https://www.dhi.ac.uk/api/data/…”), and I could infer that it aggregates data into a CSV file related to oldbailey_offences… but it isn’t mentioned anywhere in the API documentation.
We all like a good mystery, don’t we? Let’s have a quick look at what it fetches.
This looks nice. There’s a key
column with what I know are our offence categories’ names and a doc_count
column, presumably counts of results. Counting what though? The default for statistical offence searches is to count by offence, so that’s the most likely (and is confirmed on the site results page).
obo_offence_category_stats_csv
# A tibble: 9 × 2
key doc_count
<chr> <dbl>
1 breakingPeace 7696
2 damage 1054
3 deception 15879
4 kill 5378
5 miscellaneous 3233
6 royalOffences 10681
7 sexual 7224
8 theft 151011
9 violentTheft 9016
Let’s test it out by running breakingPeace
through an API search query.
<- fromJSON("https://www.dhi.ac.uk/api/data/oldbailey_record?offence=breakingPeace&div_type=trialAccount") breaking_peace_json
That gets, as anticipated, the first ten results for offences in the top level category of Breaking Peace. The total trial hits
(7465) are lower than the doc_count
(7696).
jsonedit(breaking_peace_json$hits)
Digging deeper
I’m calling these agg
queries to differentiate them from the API search and single record queries I’ve already written about.
After various tests and snafus, I worked out the key components of the URLs:
- the fixed part of the URL is https://www.dhi.ac.uk/api/data/csv/agg/
- the unit to count by is in the next path segment before the query string, eg /oldbailey_offence; it’s always prefixed by “oldbailey_”
- the important bits of the query string are
series
androws
NB: you may also see a countby
in query strings when exploring searches. I don’t advise using it in API queries; I tested it out quite a bit and I think it either doesn’t do anything at all or the oldbailey_*
path parameter always takes precedence.
what to count by?
I should emphasise that what you count by in OBO statistical searches can matter a lot.
There are in fact six options: offence, verdict, punishment, defendant, victim and trial. Why so many?!
The OBO data is not simple tabular data. In any given trial, there can be multiple defendants, offences, verdicts, punishments and victims. So you can get (as seen with the breakingPeace
query) very different results when you change what you count by, and the most appropriate choice is likely to depend on what kind of search or analysis you’re doing.
We spent quite a while, way back in about 2008, discussing the most appropriate countby defaults for each search; if you run a few queries on the site changing the search category but keeping the default counts, you can see how it changes. The complexity of the statistics tool led us to write a guide to its use.
It doesn’t become any less complex just because you interact it with it via the API rather than the search form on the site. In some respects it needs even more care, because searches on the site return much more information about the search (in the display it tells you exactly what was counted and what it was counted by). The agg
query downloads lose much of that context. Plus, when constructing an agg
query, you have to specify what you want to count by; you can’t just let the site do it for you.
Let’s look again at the offence categories query, and this time change to /oldbailey_trial. Now the total for breakingPeace
matches the json hits total.
<-
obo_offence_trial_query read_csv("https://www.dhi.ac.uk/api/data/csv/agg/oldbailey_trial?series=offence_category")
obo_offence_trial_query
# A tibble: 9 × 2
key doc_count
<chr> <dbl>
1 breakingPeace 7465
2 damage 1027
3 deception 14116
4 kill 5301
5 miscellaneous 3163
6 royalOffences 10477
7 sexual 6943
8 theft 142539
9 violentTheft 8628
$hits$total breaking_peace_json
[1] 7465
In summary:
The one time it shouldn’t matter too much what you count by is if you simply want a list of the available terms for a search query.
If you want counts that will match the hits totals returned by the search API, use /oldbailey_trial
. (You’ll also need to add div_type=trialAccount
to the search query to ensure they match exactly.)
If you want counts for a statistical analysis and you want your results to be reproducible, choose the most appropriate count carefully and document your choice every time.
query strings
If you’re looking at the statistics search form it has four main sections (apart from filtering options in the sidebar, which I’m not going into here but would expect to work in much the same way as other API queries):
- select a search category (required)
- select a second category (optional) - adding this generates more complex tables (crosstabs)
- count by (optional)
- display output (optional)
1 and 2 use identical lists of variables, which I’ve pulled out of the search form for convenience.
<-
obo_agg_variables_csv read_csv(here::here("data/obo_agg_variables.csv"))
$name obo_agg_variables_csv
[1] "offence_category" "offence_subcategory" "plea"
[4] "verdict_category" "verdict_subcategory" "punishment_category"
[7] "punishment_subcategory" "defendant_gender" "defendant_age"
[10] "victim_gender" "victim_age" "victim_hisco_class_1"
[13] "victim_hisco_label" "victim_institution" "decade"
[16] "year"
Translating from the search form to agg
query strings, the first search box is equivalent to series
and the second, if you want it, is rows
. As far as I can tell from testing you’ll never need anything else. You can have any combination of variables you like in crosstabs (though some are likely to make more sense than others).
a crosstabs example
Let’s say I’d like a breakdown of offence subcategories per year, counting by offence.
<- "https://www.dhi.ac.uk/api/data/csv/agg/oldbailey_offence?rows=offence_subcategory&series=year"
agg_query_url
<- read_csv(agg_query_url) offence_subcategory_year_csv
The result:
offence_subcategory_year_csv
# A tibble: 240 × 74
key doc_count animalTheft bigamy burglary coiningOffences grandLarceny
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1674 41 5 1 12 1 1
2 1675 59 4 0 8 1 7
3 1676 87 4 5 11 1 9
4 1677 91 6 2 8 5 7
5 1678 132 16 4 2 8 26
6 1679 131 10 0 6 10 8
7 1680 127 5 1 16 9 21
8 1681 135 8 3 11 2 22
9 1682 174 5 4 7 5 34
10 1683 215 17 3 5 10 34
# ℹ 230 more rows
# ℹ 67 more variables: highwayRobbery <dbl>, housebreaking <dbl>,
# infanticide <dbl>, killOther <dbl>, murder <dbl>, rape <dbl>,
# receiving <dbl>, religiousOffences <dbl>, theftFromPlace <dbl>,
# theftOther <dbl>, arson <dbl>, assault <dbl>, forgery <dbl>,
# miscellaneousOther <dbl>, pervertingJustice <dbl>, pettyLarceny <dbl>,
# pettyTreason <dbl>, pocketpicking <dbl>, robbery <dbl>, …
Scaling up
Chances are you’ll want to get more than one thing at a time! Here are a couple of examples.
download multiple CSVs
Maybe you want to get the CSVs for all the search variables, ready to use in search scripts. Here’s one way to do it, for the simple series
search for each variable.
First build the query URLs and filenames.
# construct a) URLs and b) filenames from the agg variables list
# I like to embed info about the data in the filename:
# name of variable + count by + date of download
<- today() |> format('%Y%m%d')
date_stamp
<-
obo_agg_variables |>
obo_agg_variables_csv mutate(url = glue("https://www.dhi.ac.uk/api/data/csv/agg/oldbailey_trial?series={name}")) |>
mutate(filename = glue("{name}-bytrial-{date_stamp}"))
# pull out the variable names
<- obo_agg_variables$name
obo_agg_variables_names
# pull out the filenames
<- obo_agg_variables$filename obo_agg_variables_filenames
Then the process for fetching and saving is much the same as I used in my previous API search example
# function to write the CSVs to files
# the outputs/csv/single/ folder must already exist
<- function(data, names) {
output_obo_single_csv <- here::here("outputs/csv/single")
folder_path write_csv(data, paste0(folder_path, "/", names, ".csv") )
}
# slightly slow down read_csv with slowly() when it's used in map.
<-
slow_csv slowly(read_csv, rate=rate_delay(1))
# use purrr::map to fetch the data for each variable
<- map(obo_agg_variables$url, slow_csv)
obo_agg_variables_csvs
# add the variable names back to the list elements
names(obo_agg_variables_csvs) <- obo_agg_variables_names
# write the CSVs with purrr::pmap.
invisible(
list(
data=obo_agg_variables_csvs,
names=obo_agg_variables_filenames
|>
) pmap(output_obo_single_csv)
)
Checking out one of the files:
<-
obo_agg_punishment_category_csv read_csv(here::here("outputs/csv/single/punishment_category-bytrial-20250529.csv"))
obo_agg_punishment_category_csv
# A tibble: 6 × 2
key doc_count
<chr> <dbl>
1 corporal 8910
2 death 10486
3 imprison 80937
4 miscPunish 12492
5 noPunish 4838
6 transport 40521
all the combinations for crosstabs
There is a neat R function, combn(), to make a set of all unique pairs from a list, when you want to avoid reverse-duplicates (ie, you don’t need both “a”+“b” and “b”+“a”). The only downside is that it returns a matrix, so you need to convert back to a tibble/dataframe.
<-
obo_agg_variables_pairs combn(obo_agg_variables_names, m=2 ) |>
# it's a *very* wide matrix; t() transposes rows and columns
t() |>
# convert matrix to tibble; .name_repair to create column names
as_tibble(.name_repair = ~c("series", "rows"))
obo_agg_variables_pairs
# A tibble: 120 × 2
series rows
<chr> <chr>
1 offence_category offence_subcategory
2 offence_category plea
3 offence_category verdict_category
4 offence_category verdict_subcategory
5 offence_category punishment_category
6 offence_category punishment_subcategory
7 offence_category defendant_gender
8 offence_category defendant_age
9 offence_category victim_gender
10 offence_category victim_age
# ℹ 110 more rows
On the other hand, if you want all the possible combinations including reverse-duplicates, you could instead use tidyr::expand_grid() which will “create a tibble from all combinations of inputs”. That’ll include identical pairs, so you need to filter those out.
<-
obo_agg_variables_pairs_all expand_grid(rows=obo_agg_variables_names,
series=obo_agg_variables_names) |>
filter(rows != series)
obo_agg_variables_pairs_all
# A tibble: 240 × 2
rows series
<chr> <chr>
1 offence_category offence_subcategory
2 offence_category plea
3 offence_category verdict_category
4 offence_category verdict_subcategory
5 offence_category punishment_category
6 offence_category punishment_subcategory
7 offence_category defendant_gender
8 offence_category defendant_age
9 offence_category victim_gender
10 offence_category victim_age
# ℹ 230 more rows