
The first time I encountered the BigQuery export schema this year my heart sank: arrays and structs were not something covered in my SQL intro course! But having spent a few months extracting data like this I’ve come to appreciate the logic. These are all the ‘notes to self’ I wish I’d had at the outset!
I’m going to start by unpacking some of the examples found in the Working with arrays documentation in a bit more detail, and then all subsequent examples query The Met collection vision_api_data in BigQuery which is a public dataset so you should be able to run them for yourself and then experiment with more.
The main concepts
The easiest way to understand how to get the data OUT is to start by looking at what happens when we put the data IN. By the end of this section, you’ll also be able to describe the difference between an array and a struct.
The array
An array is required when there are multiple field values associated with a single record – it represents a 1:many relationship. The following are all examples of when an array might be suitable:
- List of artworks per room
- List of movies recommended per subscriber
- List of wines per region
Let’s create a simple array example: 3 records, each containing an array. We insert array elements within square brackets as shown. By using UNION ALL we end up with 3 records, each containing their own array of values:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT ['painting', 'sculpture', 'installation'] AS artworks | |
UNION ALL | |
SELECT ['drawing', 'painting'] AS artworks | |
UNION ALL | |
SELECT ['painting', 'drawing', 'collage'] AS artworks |
The result is Row 1 with 3 associated values, Row 2 with 2 associated values and Row 3 with 3 associated values:
Arrays always contain variables of the same type, so the above statement creates 3 arrays that all contain the STRING data type.
Now there are 2 basic ways to get this data out – here is the most obvious:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WITH paintings AS | |
( | |
SELECT ['painting', 'sculpture', 'installation'] AS artworks | |
UNION ALL | |
SELECT ['drawing', 'painting'] AS artworks | |
UNION ALL | |
SELECT ['painting', 'drawing', 'collage'] AS artworks | |
) | |
SELECT | |
artworks | |
FROM | |
paintings |
This gives us exactly what we started with: our 3 records, each with their associated arrays:
But sometimes we would rather get those values out in a non-array format – otherwise known as flattened. For this we can use CROSS JOIN UNNEST() – and the thing we unnest is the array (in this case artworks):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WITH paintings AS | |
( | |
SELECT ['painting', 'sculpture', 'installation'] AS artworks | |
UNION ALL | |
SELECT ['drawing', 'painting'] AS artworks | |
UNION ALL | |
SELECT ['painting', 'drawing', 'collage'] AS artworks | |
) | |
SELECT | |
artworks | |
FROM | |
paintings | |
CROSS JOIN UNNEST(artworks) AS artworks |
Notice we now have 8 ‘normal’ rows containing our data:
The struct
A struct is required when there are sub-types of information for a single record. The following are all examples of when a struct might be suitable:
- List of several attributes of artworks per room
- List of actors and directors of movies recommended per subscriber
- List of chemical characteristics of wines per region
Let’s create a simple example from the documentation – notice how we are actually constructing an array of STRUCTS (see those square brackets surrounding the whole thing?):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT "800M" AS race, | |
[STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits), | |
STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits), | |
STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits), | |
STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits), | |
STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits), | |
STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits), | |
STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits), | |
STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)] | |
AS participants |
We have essentially only created 2 items at the highest level: race (a STRING) and participants (a STRUCT) but because of the way the struct is set up, participants has 2 nested fields: name (a STRING) and splits (an array). So we effectively end up with 3 fields of data, one of which is an array with multiple values (splits):
To extract the data in the normal ‘vanilla’ way, you can only refer to those top 2 items we created: races and participants:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WITH races AS ( | |
SELECT "800M" AS race, | |
[STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits), | |
STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits), | |
STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits), | |
STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits), | |
STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits), | |
STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits), | |
STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits), | |
STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)] | |
AS participants) | |
SELECT | |
race, | |
participants | |
FROM races |
We get all our data BUT this is not always convenient format…
What if you only wanted to get the names of participants? Well remember that we created this data as an array of structs so we can unnest that participants array:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WITH races AS ( | |
SELECT "800M" AS race, | |
[STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits), | |
STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits), | |
STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits), | |
STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits), | |
STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits), | |
STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits), | |
STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits), | |
STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)] | |
AS participants) | |
SELECT | |
race, | |
name | |
FROM races AS r | |
CROSS JOIN UNNEST(r.participants) as participant |
And voilà, we can now refer to the participants’ name field individually to extract it:
Now, what about splits? That is an array within a struct… well, we can just keep on unnesting…
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WITH races AS ( | |
SELECT "800M" AS race, | |
[STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits), | |
STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits), | |
STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits), | |
STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits), | |
STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits), | |
STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits), | |
STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits), | |
STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits)] | |
AS participants) | |
SELECT | |
r.race, | |
participant.name, | |
splits | |
FROM races AS r | |
CROSS JOIN UNNEST(r.participants) as participant | |
CROSS JOIN UNNEST(participant.splits) AS splits |
And now we have a fully-flattened structure:
One more point before we move on: you’ll notice that we often alias as we unnest – you don’t always have to do this BUT, as with other SQL scenarios you’ll find that it helps avoid potential problems with duplicate names down the line.
A typical dataset – nested and repeated…
So, having seen how we construct and extract array and struct data, let’s have a look at a typical dataset that has been set up using these features: `bigquery-public-data.the_met.vision_api_data`. Remember this is a public dataset so you’ll be able to query it too…
First, let’s look at the start of the schema: this is a fairly typical scenario:
object_id is the unique identifier for each artwork in The Met collection. faceAnnotations is a struct containing multiple different data points to do with faces in the artworks.
For each object, there may be multiple faceAnnotations entries (this makes sense right? think of a family portrait – there is one object (the portrait) but there could be multiple faces in the portrait: each requiring unique annotations about their properties.
In fact, we could change the schema to look like this and then the various 1:1 or 1:many relationships become more obvious:
- RECORD REPEATED indicates a 1:many relationship
- RECORD NULLABLE indicates a 1:1 relationship
Let’s have a look at one of the images in detail:

bigquery-public-data.the_met.images
object_id = 647580If we want to pull out ALL the data for faceAnnotations then no fancy footwork is required. By running this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT | |
object_id | |
, faceAnnotations | |
FROM | |
`bigquery-public-data.the_met.vision_api_data` | |
WHERE | |
object_id = 647580 |
… we get this:
Notice how the data is returned:
- There is only one face in our image, so we only have one set of record data per object_id to deal with in this particular scenario
- Our lady is wearing a hat, and the headwearLikelihood is VERY_LIKELY – fabulous!
- Our bounding box has 4 points: one for each corner, and we see these vertices listed as expected.
Now, what if all we want from this data is whether or not there is headwear in the picture? This is where our CROSS JOIN UNNEST() comes in…
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT | |
object_id | |
, headwearLikelihood | |
FROM | |
`bigquery-public-data.the_met.vision_api_data` AS main | |
CROSS JOIN UNNEST(faceAnnotations) AS faceA | |
WHERE | |
object_id BETWEEN 647500 and 647600 |
Having unnested faceAnnotations we can now just extract headwearLikelihood:
What about one level down? Supposing we want to extract only the data for those x and y vertices? We unnest faceAnnotations and boundingPoly.vertices and we’re good to go:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT | |
object_id | |
, y | |
, x | |
FROM | |
`bigquery-public-data.the_met.vision_api_data` AS main | |
CROSS JOIN UNNEST(faceAnnotations) fA | |
CROSS JOIN UNNEST(fA.boundingPoly.vertices) vertices | |
WHERE | |
object_id BETWEEN 647500 and 647600 |
Here are our results:
Additional cool functions
ARRAY_AGG() – get it back into array format
Having un-nested and extracted just the desired data, you may want to get it back into an array format. This is where ARRAY_AGG() comes in handy – consider the following example: we’re getting out description from labelAnnotations, and then putting those resulting descriptions back into array format so that overall we still end up with the 41 records which represent the 41 Claude Monet works currently in The Met collection:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT | |
main.object_id | |
, title | |
, artist_display_name | |
, ARRAY_AGG(lA.description) AS description | |
FROM `bigquery-public-data.the_met.objects` AS main | |
INNER JOIN | |
`bigquery-public-data.the_met.vision_api_data` AS api | |
ON main.object_id = api.object_id | |
CROSS JOIN UNNEST(labelAnnotations) lA #notice how UNNEST is placed after the INNER JOIN | |
WHERE | |
LOWER(artist_display_name) = 'claude monet' | |
GROUP BY | |
1, 2, 3 |
Here are the resulting descriptions of the first couple of titles:
Remember that without ARRAY_AGG() we would have ended up with a fully flattened format, like this:
ARRAY_TO_STRING() – get it into delimited format
For further processing, especially in Python, it might actually be more useful to have those descriptions in a single, delimited field. What we do is take the resulting array formed by ARRAY_AGG() and use ARRAY_TO_STRING() to convert it to a delimited string – you can specify a delimiter of your choice:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT | |
main.object_id | |
, title | |
, artist_display_name | |
, ARRAY_TO_STRING(ARRAY_AGG(lA.description), ';') AS description | |
FROM `bigquery-public-data.the_met.objects` AS main | |
INNER JOIN | |
`bigquery-public-data.the_met.vision_api_data` AS api | |
ON main.object_id = api.object_id | |
CROSS JOIN UNNEST(labelAnnotations) lA | |
WHERE | |
LOWER(artist_display_name) = 'claude monet' | |
GROUP BY | |
1, 2, 3 |
Now we have an easily exportable format with one line per record:
My understanding may still need some refinement so if you have corrections or suggestions for me you’re most welcome to get in touch. Thanks for reading!