GCP BigQuery Scripting

BigQuery stored procedures are incredibly useful. Google defines them as being “a collection of statements that can be called from other queries or other stored procedures. A procedure can take input arguments and return values as output.”

Lovely! So we can declare some variables and then feed them as input arguments to our stored procedure when it runs, for example:

DECLARE from_date DATE DEFAULT CURRENT_DATE() – 7;
DECLARE to_date DATE DEFAULT CURRENT_DATE();
DECLARE company STRING DEFAULT 'Company A';
CALL my_project.my_dataset.my_stored_procedure(from_date, to_date, company)
view raw stored_proc.sql hosted with ❤ by GitHub

More complex scenarios

BUT what do we do when there are, for example, multiple companies? And perhaps each company’s data is stored in a different table? And perhaps each company has some unique WHERE clauses? And the outputs need to be written to different tables? Essentially the code to process the data can be run through one stored procedure, but we need the flexibility to deal with complex scenarios while avoiding lots of boilerplate code or hard-coding.

Scripting allows us to create a single stored procedure which can be used to update data for multiple companies / tables, etc. Rather than, e.g. creating 4 stored procedures or scheduled queries to update 4 different tables, we create a single stored procedure and use variables to insert the correct source and destination tables and any other specifics like where clauses at runtime.

This process cuts down on boilerplate code, and makes it very simple to add additional companies or other types of variable without having to create new scripts.

The following is just one way to realize this dream and does take a little extra up-front thought and effort, but once in place it’s incredibly simple to setup, define, troubleshoot and change – and even reduces the chances of errors occurring when updates are required.

Scripting overview

There are only 2 main components to setup this scripting methodology:

  • The scripting config table – this will hold all variables required to run each variant of your stored procedure
  • The stored procedure itself – which reads in and uses the variables stored in the scripting config table

Scripting config table

This table can be stored anywhere, but it would be advisable to keep it in a dataset of its own to prevent accidental deletion! You also need to make sure that you can retrieve past values, just in case of accidental changes (it’s so easy to run TRUNCATE TABLE in BigQuery and no questions are asked ๐Ÿคฃ).

Key fields in the table include:

  • record_uuid – unique identifier for each record in the table
  • process_name – descriptive name for the process driven by the record, e.g. ‘update_customer_scores’. This will be used to select which process should be referenced when the stored procedure runs.
  • company – or other primary identifier, depending on your use case. This will be used to make sure the right source tables, target tables and where clauses are inserted at run time for each company.
  • source_tables – a JSON field where you can specify multiple source tables required by the stored procedure
  • source_where_clauses – a JSON field where you can specify multiple WHERE clauses or other types of variables required by the stored procedure
  • target_tables – a JSON field where you can specify multiple target tables required by the stored procedure

Below is an example of how this table might be populated

INSERT `my_project.bq_scripting.metadata_config` (record_uuid, process_name, company, source_tables, source_where_clauses, target_tables)
VALUES
(
GENERATE_UUID()
, 'update_customer_scores'
, 'Company A'
, JSON '{"Scores":"my_project.customers.daily_scores_company_a","ScoreAttributions":"my_project.customers.attributions_company_a"}'
, JSON '{"AppName":"CompanyA","MainSiteIds":"(8, 9, 10, 12, 46, 47, 48, 53, 55, 56, 57, 73)"}'
, JSON '{"Target":"my_project.dashboards.customer_profiling"}'
),
(
GENERATE_UUID()
, 'update_customer_scores'
, 'Company B'
, JSON '{"Scores":"my_project.customers.daily_scores_company_b","ScoreAttributions":"my_project.customers.attributions_company_b"}'
, JSON '{"AppName":"CompanyB","MainSiteIds":"(58)"}'
, JSON ''{"Target":"my_project.dashboards.customer_profiling"}'
)

In the above scenario the first JSON entry refers to source_tables where Scores is how we will refer to the table containing the relevant company scores and ScoreAttributions is how we will refer to the table containing the relevant attributions. Notice how the values for each of these aliases differs between Company A and Company B. The second JSON entry refers to the custom values required for the various source_where_clauses for each company. Notice that the entries for Target in the third JSON entry which refers to target_tables are the same for both companies: this is because both sets of results will be inserted into a single table.

Stored procedure usage

The essential outline of what we will do in this step is create a stored procedure which will:

  1. Read the variables per paywall the we setup in the scripting config table
  2. Compile a query per company with the appropriate variables
  3. Insert each query into an array of queries
  4. Loop through each entry in the scripting config table to assemble the query and execute it

    Below is a skeleton outline for creating a script with each partโ€™s contribution explained inline for clarity. This can be used as a template to create your own stored procedures. Highlights to note include:

    • You may use a combination of stored procedured input parameters and config table input parameters as required
    • DECLARE a variable of type STRING for each query to be executed
    • DECLARE a variable of type ARRAY<STRING> which will handle the execution (one query to be executed per line in the config table)
    • Note EXECUTE IMMEDIATE statements are what actually run the code
    • These can be commented out for testing purposes if required
    • It should also be noted that each EXECUTE IMMEDIATE statement can only run one DDL or DML so you’ll need to declare a variable for each delete, insert, merge, etc. that you intend to run – they cannot be run altogether in one statement
    • The text of the query to be run should be enclosed in triple quotes ("""...""")
    • Variable names to be referenced in the query should be enclosed in triple quotes and double pipes ("""||...||""")
    • Add additional punctuation around the variable names as required, e.g. in this example backticks are added for the table name: DELETE FROM `"""||Target||"""`
    • Use JSON_EXTRACT_SCALAR to extract individual values from the relevant JSON column in the config table, and the format ‘$.VariableName’ to specify which value you want to specifically extract
    • Filter by process_name and company from your config table so you only get the values required for this particular stored procedure
    • The actual execution is run in a for loop which can be sequenced if that is important, e.g. if Company A should run before Company B
    • Gotcha ๐Ÿ‘พ๐Ÿ‘พ๐Ÿ‘พ… Any REGEX inside the script will need to be formed with extra escape characters so, e.g. REGEXP_EXTRACT(LOWER(Url),r'([a-z\-0-9]*\-[0-9]{8}(?:\-[0-9]*)?)(?:[?].*?)?$') will be expressed as REGEXP_EXTRACT(LOWER(Url),r'([a-z\\-0-9]*\\-[0-9]{8}(?:\\-[0-9]*)?)(?:[?].*?)?$') in the script. However when the final code is assembled this resolves to its โ€˜normalโ€™ format.
    /* Create or update the stored procedure, which can have input parameters in addition to the parameters which
    will be used from metadata_config. if desired specify a name and type for each input parameter to be used */
    CREATE OR REPLACE PROCEDURE
    `my_project.stored_procedures.update_customer_scores`(key_date DATE)
    BEGIN
    /* Each EXECUTE IMMEDIATE statement (see below) can only run ONE DDL or DML
    so we need to declare separate variables for each one, hence delete_* and
    insert_* variants shown in the following example */
    DECLARE delete_sql ARRAY<STRING>; # array of all delete scripts to be run
    DECLARE delete_text STRING; # individual delete query to be run
    DECLARE insert_sql ARRAY<STRING>; # array of all insert scripts to be run
    DECLARE insert_text STRING; # individual insert query to be run
    /* Now we create an array of the scripts to be run (delete_sql) – composed of the
    individual scripts to be run with their paywall-specific variables (delete_text) */
    SET delete_sql = (SELECT ARRAY_AGG(delete_text) FROM
    (
    /* Main script with variables */
    SELECT
    — Enclose query text in triple quotes
    """
    –Enclose variable names in triple quotes and double pipes """||VariableName||"""
    — and add any additional punctuation around it, e.g. “ or ''
    DELETE FROM `"""||Target||"""` WHERE DATE(score_date) = '"""||KeyDate||"""';
    """
    AS delete_text
    FROM
    (
    /* Compile variables for use above */
    SELECT
    — Some version of this can be used to determine the order in which records are processed if important
    RANK() OVER(ORDER BY record_uuid) AS record_number
    — Note that JSON_EXTRACT_SCALAR lets us extract individual values from a JSON column,
    — specify the JSON column and then use the format '$.VariableName' to specify which
    — value you want
    , JSON_EXTRACT_SCALAR(target_tables, '$.Target') AS Target
    — Here we are using our DATE input parameter in addition to the parameters from
    — the config table
    , key_date AS KeyDate
    FROM `my_project.bq_scripting.metadata_config`
    WHERE process_name = 'update_customer_scores'
    )
    )
    );
    /* Now we create an array of the scripts to be run (insert_sql) – composed of the
    individual scripts to be run with their paywall-specific variables (insert_text) */
    SET insert_sql = (SELECT ARRAY_AGG(insert_text) FROM
    (
    /* Main script with variables */
    SELECT
    """
    INSERT `"""||Target||"""`
    WITH subquery AS
    (
    SELECT
    a.somefield1
    , b.somefield2
    FROM `"""||Scores||"""` a
    INNER JOIN `"""||ScoreAttributions||"""` b
    ON a.somefield1 = b.somefield1
    WHERE somefield1 IN """||MainSiteIds||"""
    AND score_date = '"""||KeyDate||"""'
    )
    SELECT * FROM subquery
    """
    AS insert_text
    FROM
    (
    /* Compile variables for use above */
    SELECT
    RANK() OVER(ORDER BY record_uuid) AS record_number
    , JSON_EXTRACT_SCALAR(source_where_clauses, '$.MainSiteIds') AS MainSiteIds
    , JSON_EXTRACT_SCALAR(source_tables, '$.Scores') AS Events
    , JSON_EXTRACT_SCALAR(source_tables, '$.ScoreAttributions') AS EventsFlattened
    , JSON_EXTRACT_SCALAR(target_tables, '$.Target') AS Target
    , key_date AS KeyDate
    FROM `my_project.bq_scripting.metadata_config`
    WHERE process_name = 'update_customer_scores'
    )
    )
    );
    /* We use a for loop to do the actual execution to perform the delete*/
    FOR record IN
    (
    SELECT RANK() OVER(ORDER BY record_uuid) AS record_number
    , paywall
    , source_tables
    , source_where_clauses
    , target_tables
    , key_date AS KeyDate
    FROM `my_project.bq_scripting.metadata_config`
    WHERE process_name = 'update_customer_scores'
    )
    DO
    — Here we create the script based on the variables for each record in turn
    SET delete_text = delete_sql[ordinal(record.record_number)];
    — And finally we execute the script. Comment out EXECUTE IMMEDIATE
    — to view / test the outputs without actually performing updates
    EXECUTE IMMEDIATE delete_text;
    END FOR;
    /* We use a for loop to do the actual execution to perform the insert */
    FOR record IN
    (
    SELECT RANK() OVER(ORDER BY record_uuid) AS record_number
    , paywall
    , source_tables
    , source_where_clauses
    , target_tables
    , key_date AS KeyDate
    FROM `my_project.bq_scripting.metadata_config`
    WHERE process_name = 'update_customer_scores'
    )
    DO
    — Here we create the script based on the variables for each record in turn
    SET insert_text = insert_sql[ordinal(record.record_number)];
    — And finally we execute the script. Comment out EXECUTE IMMEDIATE
    — to view / test the outputs without actually performing updates
    EXECUTE IMMEDIATE insert_text;
    END FOR;
    END;

    Testing & troubleshooting

    To do a test run, to generate each script with its associated variables, comment out all your EXECUTE IMMEDIATE statements, and then call your stored procedure. The results will look similar to this:

    The first 2 items above SELECT STRUCT, etc. will contain the code to be executed for each record, click on VIEW RESULTS and you will see the query text generated for each script.

    To validate whether your script would have run successfully, copy the query text into a new BigQuery window and troubleshoot any errors as usual, updating your script as you go.

    The second 2 items above SELECT RANK() etc. will show you what all of your variables look like that were inserted. This can also be useful as a quick check that errors have not been introduced into the config table.

    Productionizing & updating

    Once you are happy that your scripts will run, remember to put back your EXECUTE IMMEDIATE statements! Typically you could now use a scheduled query to get your stored procedure to run daily. Don’t forget that multiple stored procedures can be included into a single scheduled query, for example:

    CALL `my_project.stored_procedures.update_customer_scores`(CURRENT_DATE());
    CALL `my_project.stored_procedures.update_vendor_scores`(CURRENT_DATE());
    CALL `my_project.stored_procedures.update_department_scores`(CURRENT_DATE());
    view raw calling.sql hosted with ❤ by GitHub

    The first item is the imaginary scenario that we worked where Company A and Company B were required. If we receive a request to add Company C we simply update the config table and our work is done! For example:

    INSERT `my_project.bq_scripting.metadata_config` (record_uuid, process_name, company, source_tables, source_where_clauses, target_tables)
    VALUES
    (
    GENERATE_UUID()
    , 'update_customer_scores'
    , 'Company C'
    , JSON '{"Scores":"my_project.customers.daily_scores_company_c","ScoreAttributions":"my_project.customers.attributions_company_c"}'
    , JSON '{"AppName":"CompanyC","MainSiteIds":"(94, 97)"}'
    , JSON '{"Target":"my_project.dashboards.customer_profiling"}'
    )

    Happy scripting ๐Ÿ˜ƒ