Beginner’s Guide To GCP BigQuery (Part 1)

Josh Kim
Level Up Coding
Published in
8 min readJun 20, 2023

--

In my 7 years of Data Science journey, I’ve been exposed to a number of different databases including but not limited to Oracle Database, MS SQL, MySQL, EDW, and Apache Hadoop. And since the advent of cloud data warehouse, I was lucky enough to get a good amount of exposure on Google Cloud Platform in the early stages of the era which became my competitive edge in this wild job market.

A lot of you who are already in the data science field must be familiar with BigQuery and its advantages. This article is dedicated to aspiring or entry level data scientists and engineers who wish to break into the industry and/or who want to explore the field of data science as one of your future career paths.

Before we dive into some of the key features of BigQuery, let’s see what are the key advantages of GCP BQ over traditional database systems:

  1. Automatic scaling: With BigQuery, you don’t have to worry about setting up servers or maintaining infrastructure. It will automatically scale queries to handle any size data set, so you can focus on analyzing your data.
  2. Serverless: BigQuery is a serverless solution, which means you don’t have to worry about capacity planning, software updates, or patching.
  3. Cost-effective: BigQuery provides a cost-effective solution, as you only pay for the queries and storage you use. This makes it affordable for businesses of all sizes.
  4. Speed: BigQuery provides lightning-fast query performance, which means you can analyze your data quickly and make better decisions based on the insights you gain.
  5. Integration: BigQuery seamlessly integrates with other Google Cloud Platform tools and services, so you can easily use it together with other solutions like Google Data Studio, Cloud Storage, Vertex AI, and more.

Overall, my experience with BigQuery has been nothing but pleasant as it provides a cost-effective serverless solution that is incomparably fast and seamlessly integrates with other services inside the Google platform.

Now let’s get into the main topic of the article. I wanted to quickly introduce some of the features available in BigQuery and how each of them distinguish from one another.

For the part 1 of this article, I wanted to cover Tables, Views, Stored Procedures, and Materialized Views.

And as a spoiler, in part 2, I will cover more advanced features such as partitioned tables, scheduled queries, scalar functions, and external tables.

https://usercentrics.com/knowledge-hub/deal-with-google-bigquery/

Tables

The table in GCP BigQuery is a collection of rows and columns that can store and manage massive amounts of data. It’s a managed, cloud-based service that’s designed to handle big data processing with ease. Tables inherent the key characteristics of its platform BigQuery which provides an upper hand over traditional databases.

There are a number of different ways to create a table in BQ and I want to introduce a few different ways.

The most commonly used way (by myself, at least) is by using the DDL statement as below:

CREATE OR REPLACE example_dataset.table_name AS 

SELECT * FROM example_dataset.source_table_name

Another way to create a table in a similar manner, but by using more of a GUI that purely by code is using the query results. All you need to do is run a SELECT statement and then do SAVE RESULTS → BigQuery Table.

A third way is by using the integration capability with GCS Bucket. It’s as simple as uploading a file (.txt or .csv file, preferrably) to a Cloud Storage and then use a create table feature under the dataset and set these simple parameters as below.

There are of course other ways than above listed, however, these are the three most common ways I use when creating tables in BigQuery. Some of the other ways are creating a table 1) using the command line in Google Cloud console, 2) using the APIs, or 3) from Vertex AI Workbench. I’ll leave these methods on you to embark on your own research and familiarize yourself.

Views

Views in GCP BigQuery are virtual tables defined by SQL query that can display the results of a query or be used as the base for other queries. They are not physical tables and do not store any data. Instead, they reference data stored in other tables. What’s great about views are once you create a view, you can still query the view as you would query a table.

Some of the additional characteristics of views are:

  1. Views are not materialized, meaning they don’t store data but instead represent a query on top of a table.
  2. They are read-only and cannot be modified by DML statements.
  3. They can be used to control access to the underlying data.
  4. They can be used to simplify complex queries by breaking them into smaller, more manageable pieces.
  5. They can be used to abstract the underlying data from end users or clients.

Unlike tables, views are often created to serve more specific purposes such as limit users’ data access or to segment complex queries. A well designed database utilizes views at the right place and at the right time.

The single most common way to create a view in a dataset is by CREATE VIEW DDL statement and you can refer to the official documentation to explore more options.

CREATE OR REPLACE VIEW example_view_name AS
SELECT name, email, phone_number
FROM example_table_name;

Stored Procedures

Stored procedures are a way to store and execute SQL statements in GCP BigQuery. It allows you to write a reusable piece of code that can be called from anywhere in your queries. You can use stored procedures to handle complex ETL processes, make API calls, and perform data validation.

Personally, stored procedures are my favourite BigQuery feature with which you can handle a variety of use cases. If you’re a Pythonista or savvy in any programming languages, you can think of it as a user defined functions where you can provide custom parameters to return a desired output.

In a similar manner, stored procedures have the following characteristics:

  1. It uses SQL as its programming language.
  2. Stored procedures are stored in a dataset in a specific project.
  3. The execution of a stored procedure can be triggered by a query or a user-defined function.
  4. It allows you to parameterize queries, which can be used to make the code more modular and reduce redundancy.
  5. You can use logic such as loops, conditionals, and exceptions in your procedures.

The use cases are countless as you can imagine. I’ll list a few here so you can get a high level idea and potentially help you brainstorm.

Data Validation
With stored procedures, you can validate data fields, data types, and constraints on data input to maintain data quality.

Complex Data Transformations
SP’s allow for more complex and repetitive data transformations to be executed on large datasets.

Automated Data Maintenance
By using stored procedures, repetitive tasks such as automated data archiving or data deletions in large databases can be easily scheduled.

Performance Improvements
Stored procedures can improve the performance of queries and database operations by reducing network traffic and latency, and optimizing data retrieval patterns.

Enhanced Security
Stored procedures can be used to restrict access to sensitive data, by allowing only authenticated users to execute certain queries and code.

Creating a stored procedure is as easy as creating a table or a view. SP’s are usually created with CREATE PROCEDURE statements and you need to embody the execution codes in BEGIN..END statements.

The following example is a multi-statement query that sets a variable, truncate a destination table, insert to a table, then displays the result in a table format. Also, the procedure accepts an integer as a parameter which then is used to set the variable value (yes, this step is redundant and it was written for demonstration purpose only).

CREATE OR REPLACE PROCEDURE sp_example_procedure(val3 INT) 
BEGIN
DECLARE @variable_name INT;
SET @variable_name = val3;

TRUNCATE TABLE example_table_name;

INSERT INTO example_table_name(column1, column2, column3)
VALUES ('value1', 'value2', @variable_name);

SELECT * FROM example_table_name;
END

Materialized Views

A materialized view in GCP BigQuery is like a combination of a table and a view. It stores the results of a SQL query as a table-like structure in BigQuery, similar to how a table stores data. However, unlike a regular table, Materialized Views automatically update their stored result data after the underlying tables is updated in real-time, without requiring re-running the query.

Think of a Materialized View like a snapshot of a query result that is saved as a table, and updated every time the underlying data is updated. This can be really helpful for big data sets or queries that take a long time to run, as they can be pre-computed in the Materialized view for fast access.

Here are some characteristics of materialized views in GCP BigQuery:

  1. They can be used to improve query performance for complex and time-consuming queries.
  2. They reduce the number of times a query needs to be executed on the underlying data, thus saving computation cost.
  3. They can be used for caching queries on large datasets that are frequently queried by multiple users.
  4. They are updated periodically or on-demand and you can control the frequency and mode of the updates.
  5. They do not support all SQL operations, such as ORDER BY or LIMIT, so it is important to know their limitations before using them.

In essence, Materialized Views offer low to zero maintenance and return fresh data every time and any time it’s queried regardless of how recently the base tables were updated.

CREATE MATERIALIZED VIEW sample_view AS
SELECT column1, column2, SUM(column3)
FROM sample_table
GROUP BY column1, column2;

In this article, we’ve covered some of the key advantages of BigQuery and its commonly used features. A lot of these concepts may already be familiar to you, however, it could also have been a good recap of what different data objects there are available in BigQuery and when to use which one.

In the next article I will cover the rest of the key features you can utilize in BigQuery and that will hopefully open more doors for you to build a database that is not only efficient but also cost effective to serve the exact purpose you aimed to achieve.

— Stay tuned for Part 2 —

References: https://cloud.google.com/bigquery/docs/resource-hierarchy

--

--

Hi, I am Josh, Senior Data Scientist based in Toronto, Canada. I share my Data Science & Machine Learning journey on Medium. Please follow to support.