Tutorial: BigQuery arrays and structs

Screenshot 2019-12-27 at 08.14.15
It’s nested and repeated 🙂

The first time I encountered the BigQuery export schema this year my heart sank: arrays and structs were not something covered in my SQL intro course! But having spent a few months extracting data like this I’ve come to appreciate the logic. These are all the ‘notes to self’ I wish I’d had at the outset!

I’m going to start by unpacking some of the examples found in the Working with arrays documentation in a bit more detail, and then all subsequent examples query The Met collection vision_api_data in BigQuery which is a public dataset so you should be able to run them for yourself and then experiment with more.

The main concepts

The easiest way to understand how to get the data OUT is to start by looking at what happens when we put the data IN. By the end of this section, you’ll also be able to describe the difference between an array and a struct.

The array

An array is required when there are multiple field values associated with a single record – it represents a 1:many relationship. The following are all examples of when an array might be suitable:

  • List of artworks per room
  • List of movies recommended per subscriber
  • List of wines per region

Let’s create a simple array example: 3 records, each containing an array. We insert array elements within square brackets as shown. By using UNION ALL we end up with 3 records, each containing their own array of values:

SELECT ['painting', 'sculpture', 'installation'] AS artworks
UNION ALL
SELECT ['drawing', 'painting'] AS artworks
UNION ALL
SELECT ['painting', 'drawing', 'collage'] AS artworks

view raw
bq_nest4.txt
hosted with ❤ by GitHub

The result is Row 1 with 3 associated values, Row 2 with 2 associated values and Row 3 with 3 associated values:

Screenshot 2019-12-22 at 18.13.08

Arrays always contain variables of the same type, so the above statement creates 3 arrays that all contain the STRING data type.

Now there are 2 basic ways to get this data out – here is the most obvious:

WITH paintings AS
(
SELECT ['painting', 'sculpture', 'installation'] AS artworks
UNION ALL
SELECT ['drawing', 'painting'] AS artworks
UNION ALL
SELECT ['painting', 'drawing', 'collage'] AS artworks
)
SELECT
artworks
FROM
paintings

view raw
bq_nest5.txt
hosted with ❤ by GitHub

This gives us exactly what we started with: our 3 records, each with their associated arrays:

Screenshot 2019-12-25 at 18.44.15

But sometimes we would rather get those values out in a non-array format – otherwise known as flattened. For this we can use CROSS JOIN UNNEST() – and the thing we unnest is the array (in this case artworks):

WITH paintings AS
(
SELECT ['painting', 'sculpture', 'installation'] AS artworks
UNION ALL
SELECT ['drawing', 'painting'] AS artworks
UNION ALL
SELECT ['painting', 'drawing', 'collage'] AS artworks
)
SELECT
artworks
FROM
paintings
CROSS JOIN UNNEST(artworks) AS artworks

view raw
bq_nest6.txt
hosted with ❤ by GitHub

Notice we now have 8 ‘normal’ rows containing our data:

Screenshot 2019-12-25 at 18.56.01

The struct

A struct is required when there are sub-types of information for a single record. The following are all examples of when a struct might be suitable:

  • List of several attributes of artworks per room
  • List of actors and directors of movies recommended per subscriber
  • List of chemical characteristics of wines per region

Let’s create a simple example from the documentation – notice how we are actually constructing an array of STRUCTS (see those square brackets surrounding the whole thing?):

SELECT "800M" AS race,
[STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits),
STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits),
STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits),
STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits),
STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits),
STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits),
STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits),
STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)]
AS participants

view raw
bq_nest7.txt
hosted with ❤ by GitHub

We have essentially only created 2 items at the highest level: race (a STRING) and participants (a STRUCT) but because of the way the struct is set up, participants has 2 nested fields: name (a STRING) and splits (an array). So we effectively end up with 3 fields of data, one of which is an array with multiple values (splits):

Screenshot 2019-12-25 at 19.43.26

To extract the data in the normal ‘vanilla’ way, you can only refer to those top 2 items we created: races and participants:

WITH races AS (
SELECT "800M" AS race,
[STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits),
STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits),
STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits),
STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits),
STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits),
STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits),
STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits),
STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)]
AS participants)
SELECT
race,
participants
FROM races

view raw
bq_nest8.txt
hosted with ❤ by GitHub

We get all our data BUT this is not always convenient format…

Screenshot 2019-12-25 at 19.50.08

What if you only wanted to get the names of participants? Well remember that we created this data as an array of structs so we can unnest that participants array:

WITH races AS (
SELECT "800M" AS race,
[STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits),
STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits),
STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits),
STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits),
STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits),
STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits),
STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits),
STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)]
AS participants)
SELECT
race,
name
FROM races AS r
CROSS JOIN UNNEST(r.participants) as participant

view raw
bq_nest9.txt
hosted with ❤ by GitHub

And voilĂ , we can now refer to the participants’ name field individually to extract it:

Screenshot 2019-12-25 at 19.56.59

Now, what about splits? That is an array within a struct… well, we can just keep on unnesting…

WITH races AS (
SELECT "800M" AS race,
[STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits),
STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits),
STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits),
STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits),
STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits),
STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits),
STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits),
STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)]
AS participants)
SELECT
r.race,
participant.name,
splits
FROM races AS r
CROSS JOIN UNNEST(r.participants) as participant
CROSS JOIN UNNEST(participant.splits) AS splits

view raw
bq_nest10.txt
hosted with ❤ by GitHub

And now we have a fully-flattened structure:

Screenshot 2019-12-25 at 20.00.24.png

One more point before we move on: you’ll notice that we often alias as we unnest – you don’t always have to do this BUT, as with other SQL scenarios you’ll find that it helps avoid potential problems with duplicate names down the line.

A typical dataset – nested and repeated

So, having seen how we construct and extract array and struct data, let’s have a look at a typical dataset that has been set up using these features: `bigquery-public-data.the_met.vision_api_data`. Remember this is a public dataset so you’ll be able to query it too…

First, let’s look at the start of the schema: this is a fairly typical scenario:

Screenshot 2019-12-21 at 16.12.02

object_id is the unique identifier for each artwork in The Met collection. faceAnnotations is a struct containing multiple different data points to do with faces in the artworks.

For each object, there may be multiple faceAnnotations entries (this makes sense right? think of a family portrait – there is one object (the portrait) but there could be multiple faces in the portrait: each requiring unique annotations about their properties.

In fact, we could change the schema to look like this and then the various 1:1 or 1:many relationships become more obvious:

Screenshot 2019-12-26 at 20.34.07

  • RECORD REPEATED indicates a 1:many relationship
  • RECORD NULLABLE indicates a 1:1 relationship

Let’s have a look at one of the images in detail:

bigquery-public-data.the_met.images​object_id = 647580

If we want to pull out ALL the data for faceAnnotations then no fancy footwork is required. By running this:

SELECT
object_id
, faceAnnotations
FROM
`bigquery-public-data.the_met.vision_api_data`
WHERE
object_id = 647580

view raw
bq_nest1.txt
hosted with ❤ by GitHub

… we get this:

Screenshot 2019-12-22 at 09.58.45

Notice how the data is returned:

  • There is only one face in our image, so we only have one set of record data per object_id to deal with in this particular scenario
  • Our lady is wearing a hat, and the headwearLikelihood is VERY_LIKELY – fabulous!
  • Our bounding box has 4 points: one for each corner, and we see these vertices listed as expected.

Now, what if all we want from this data is whether or not there is headwear in the picture? This is where our CROSS JOIN UNNEST() comes in…

SELECT
object_id
, headwearLikelihood
FROM
`bigquery-public-data.the_met.vision_api_data` AS main
CROSS JOIN UNNEST(faceAnnotations) AS faceA
WHERE
object_id BETWEEN 647500 and 647600

view raw
bq_nest2.txt
hosted with ❤ by GitHub

Having unnested faceAnnotations we can now just extract headwearLikelihood:

Screenshot 2019-12-22 at 10.18.32

What about one level down? Supposing we want to extract only the data for those x and y vertices? We unnest faceAnnotations and boundingPoly.vertices and we’re good to go:

SELECT
object_id
, y
, x
FROM
`bigquery-public-data.the_met.vision_api_data` AS main
CROSS JOIN UNNEST(faceAnnotations) fA
CROSS JOIN UNNEST(fA.boundingPoly.vertices) vertices
WHERE
object_id BETWEEN 647500 and 647600

view raw
bq_nest3.txt
hosted with ❤ by GitHub

Here are our results:

Screenshot 2019-12-22 at 16.46.47

Additional cool functions

ARRAY_AGG() – get it back into array format

Having un-nested and extracted just the desired data, you may want to get it back into an array format. This is where ARRAY_AGG() comes in handy – consider the following example: we’re getting out description from labelAnnotations, and then putting those resulting descriptions back into array format so that overall we still end up with the 41 records which represent the 41 Claude Monet works currently in The Met collection:

SELECT
main.object_id
, title
, artist_display_name
, ARRAY_AGG(lA.description) AS description
FROM `bigquery-public-data.the_met.objects` AS main
INNER JOIN
`bigquery-public-data.the_met.vision_api_data` AS api
ON main.object_id = api.object_id
CROSS JOIN UNNEST(labelAnnotations) lA #notice how UNNEST is placed after the INNER JOIN
WHERE
LOWER(artist_display_name) = 'claude monet'
GROUP BY
1, 2, 3

view raw
bq_nest11.txt
hosted with ❤ by GitHub

Here are the resulting descriptions of the first couple of titles:

Screenshot 2019-12-27 at 07.38.29

Remember that without ARRAY_AGG() we would have ended up with a fully flattened format, like this:

Screenshot 2019-12-27 at 08.28.58

ARRAY_TO_STRING() – get it into delimited format

For further processing, especially in Python, it might actually be more useful to have those descriptions in a single, delimited field. What we do is take the resulting array formed by ARRAY_AGG() and use ARRAY_TO_STRING() to convert it to a delimited string – you can specify a delimiter of your choice:

SELECT
main.object_id
, title
, artist_display_name
, ARRAY_TO_STRING(ARRAY_AGG(lA.description), ';') AS description
FROM `bigquery-public-data.the_met.objects` AS main
INNER JOIN
`bigquery-public-data.the_met.vision_api_data` AS api
ON main.object_id = api.object_id
CROSS JOIN UNNEST(labelAnnotations) lA
WHERE
LOWER(artist_display_name) = 'claude monet'
GROUP BY
1, 2, 3

view raw
bq_nest12.txt
hosted with ❤ by GitHub

Now we have an easily exportable format with one line per record:

Screenshot 2019-12-27 at 08.03.16

 

My understanding may still need some refinement so if you have corrections or suggestions for me you’re most welcome to get in touch. Thanks for reading!

Comments are closed.

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: