June 14, 2023

How to Set up a CICD Pipeline for Snowflake to Automate Data Pipelines

By Hiresh Roy

In recent years, data engineering teams working with the Snowflake Data Cloud platform have embraced the continuous integration/continuous delivery (CI/CD) software development process to develop data products and manage ETL/ELT workloads more efficiently.

The CI/CD pipeline plays a crucial role by automating the deployment process of various Snowflake objects such as tables, views, streams, tasks, stored procedures, etc. Automating this process significantly reduces administrative burdens and cycle times.

Ultimately, the goal of a CI/CD pipeline is to ensure the safe deployment of new changes to both Snowflake’s non-production and production environments.

In this blog, we will explore the benefits of enabling the CI/CD pipeline for database platforms. We will specifically focus on how to enable it for the Snowflake cloud platform, taking into consideration the account and schema-level object hierarchy. 

We will also discuss the difference between imperative and declarative database change management approaches. Lastly, we will provide a detailed, step-by-step guide on how to enable the CI/CD pipeline using Flyway, GitHub, and GitHub Actions.

What Are the Benefits of CI/CD Pipeline For Snowflake?

There are a lot of benefits that come with using CI/CD pipelines for data engineering teams besides an increase in productivity. Here are a few common advantages:

  • Version control with change logs
  • Quality, consistency, and security of code
  • Flexibility in the schedule for builds and deployments
  • Developer agreement and collaboration via approval and workflows
  • Dashboards and reports that provide insight into build or delivery errors
  • Stability in the environment with automated rollback features
  • The reliable and standard build process 

How CI/CD Pipelines Work With Snowflake

While working with the Snowflake cloud data warehouse platform to build data products, the data engineering team creates various objects at the account, database, and schema levels. These objects include integration objects, tables, stages, pipes, tasks, streams, stored procedures, and more.

Figure-01 below illustrates the snowflake object hierarchy and phData’s provisioning tool that can help enterprises automate the creation of your snowflake workspaces, such as collaboration(or called sandbox), integration (or QA/UAT), or production environments. 

These environments house the database and schema objects required for both governed and non-governed instances. It is essential for all schema-level objects, such as tables, file formats, stages, streams, tasks, stored procedures, views, etc., which play a crucial role in building end-to-end data pipelines, to be included in your CI/CD pipelines.

Figure-01 - Snowflake Object Hierarchy (Account Level & Schema level Objects)

In order for different team members, including database administrators, to understand which changes are coming up, their testing status, and which schema changes have made it to the various production and non-production environments, it is crucial to ensure that everyone has visibility into the progress of changes. In order to increase visibility, the engineering team can:

  • Keeping all database schema changes in a version control system such as GitHub.
  • Using a tool that tracks and records the execution of changes in different environments and captures the results.

These practices also establish a unified and reliable source of information for all changes, ensuring that the history of changes is readily accessible for auditing purposes.

A common approach for versioning database changes is to record each change as a migration script and maintain it under version control. Each migration SQL script is assigned a unique sequence number to facilitate the correct order of application. 

The subsequent step involves ensuring that every database instance has a table that keeps track of the applied migrations. By managing the database schema in this manner, you can utilize a tool to execute migration scripts and transition the database to the intended schema version.

Snowflake Database Change Management & Flyway

Flyway is a platform-independent & simple tool to manage your Snowflake database objects. As described in figure-02, Flyway follows an Imperative-style approach to Database Change Management (DCM).

When combined with a version control system like GitHub and GitHub Action, database changes can be approved and deployed through a pipeline using modern software delivery practices.

Figure-02 - Imperative Vs. Declarative Database Change Management Approaches

For insights into database change management tool selection for Snowflake, check out this article.

End-To-End Data Pipeline Use Case & Flyway Configuration

Let’s consider a scenario where you have the requirement to ingest and process inventory data on an hourly basis. In order to achieve this, you need to establish a robust and efficient continuous data processing and data pipeline solution.

The steps involved in this process can be outlined as follows in figure-03:

Figure-03 - Snowflake Continuous Data Ingestion & Processing ETL/ELT Pipeline

Step-A

The inventory data file arrives at the designated AWS S3 Stage location on an hourly basis.

Step-B

The objective of this step is to copy the inventory data file from the AWS S3 location to the staging inventory table using the Snowflake pipe object. To establish continuous data ingestion from the stage location to the Snowflake stage table, you must create the following four objects in the specified order:

    • Stage Inventory Table (V1__INVENTORY_TABLE_DDL.SQL)
    • File Format (V2__FILE_FORMAT_DDL.SQL)
    • Stage Object (V3__STAGE_DDL.SQL)
    • Pipe Object  (V4__PIPE_DDL.SQL)

Step-C

In this step, we’ll want to capture the changes using the stream object and execute a task to insert/merge curated inventory data into the curated inventory table. To facilitate the smooth transfer of data from the stage inventory table to the curated inventory table, you need to create three objects in the specified order:

    • Stream Object (V5__STREAM_DDL.SQL)
    • Curated Inventory Table (V6__CURATED_TABLE_DDL.SQL)
    • Task Object  (V7__TASK_DDL.SQL)

Here are the SQL files containing the object creation statements, logically organized and accompanied by file version names that indicate the execution sequence and how these objects are created.

				
					-- V1__INVENTORY_TABLE_DDL.SQL
CRATE OR REPLACE TABLE STAGE.INVENTORY(
   ID INT,
   COLUMN1 NUMBER,
   COLUMN2 DATE,
   COLUMN3 VARCHAR
);

-- V2__FILE_FORMAT_DDL.SQL
CREARE OR REPLACE FILE FORMAT STAGE.CSV_FILE_FORMAT
   TYPE = 'CSV'
   FIELD_DELIMITER = ','
   SKIP_HEADER = 1;

-- V3__STAGE_DDL.SQL
CREATE STAGE STAGE.MY_S3_STAGE
 STORAGE_INTEGRATION = S3_INT
 URL = 'S3://MYBUCKET/ENCRYPTED_FILES/'
 FILE_FORMAT = STAGE.CSV_FILE_FORMAT;

-- V4__PIPE_DDL.SQL
CREATE OR REPLACE PIPE STAGE.MY_PIPE
   AUTO_INGEST = TRUE
   AS
COPY INTO STAGE.INVENTORY FROM @MY_S3_STAGE

-- V5__STREAM_DDL.SQL
CREATE OR REPLACE STREAM STAGE.INVENTORY_STM
   ON TABLE STAGE.INVENTORY
   APPEND_ONLY = TRUE;

-- V6__CURATED_TABLE_DDL.SQL
CRATE OR REPLACE TABLE CURATED.INVENTORY(
   ID INT,
   COLUMN1 NUMBER,
   COLUMN2 DATE,
   COLUMN3 VARCHAR
);

-- V7__TASK_DDL.SQL
CREATE OR REPLACE TASK STAGE.MERGE_INVETORY
   WAREHOUSE = ETL_WH
   SCHEDULE = '60 MINUTE'
   WHEN
   SYSTEM$STREAM_HAS_DATA('STAGE.INVENTORY_STM')
       AS
   MERGE INTO CURATED.INVENTORY TARGET
   USING STAGE.INVENTORY_STM SOURCE ON
     CURATED.INVENTORY.ID = STAGE.INVENTORY_STM.ID
   WHEN MATCHED
      THEN UPDATE SET 
       TARGET.COLUMN1 = SOURCE.COLUMN1,
       TARGET.COLUMN2 = SOURCE.COLUMN2,
       TARGET.COLUMN3 = SOURCE.COLUMN3
   WHEN NOT MATCHED THEN
    INSERT (ID,COLUMN1,COLUMN2,COLUMN3)
      VALUES(
        SOURCE.ID,
        SOURCE.COLUMN1,
        SOURCE.COLUMN2,
        SOURCE.COLUMN3
      );


ALTER TASK MERGE_INVETORY RESUME;

				
			

Once our SQL scripts are prepared and the execution sequences of the objects are tested in our sandbox workspace, it is essential to check in and organize all the SQL scripts within a dedicated folder, as depicted in Figure-04. 

Additionally, we need to incorporate Flyway variables into the Flyway configuration file. This ensures that when these scripts are executed in different environments, such as integration or production, the variables are dynamically replaced during runtime. As a result, schema-level objects are created specifically for the respective environment.

Figure-04 SQL Script & Flyway Variables Using Placeholders

Setting Up CICD Pipeline Using PR & GitHub Action

Once we have organized our SQL scripts according to the Flyway naming convention, it is time to delve into setting up continuous integration using GitHub.

At a high level, the continuous integration and continuous deployment process using GitHub, Flyway, and GitHub Actions can be illustrated as shown in Figure-05 below.

Figure-05 Snowflake CICD Pipeline Workflow Using Github + Flyway + Github Action

This continuous integration and continuous deployment (CI/CD) process for Snowflake encompasses several steps, including defining the branching strategy and pull request (PR) workflow. Let’s explore these steps in more detail:

Step 1

Create a git repository (using GitHub) where all the database scripts following the flyway naming convention (Refer to Figure-04) can be versioned.

Step 2

Enable multiple branches with appropriate privileges for collaboration and enabling the SQL script deployment to the Snowflake workspace. Each branch serves a specific purpose, as defined below.

Figure-06 Snowflake Flyway Setup & Multiple Branching Strategy
  • Collaboration Branch (or feature branch) – the data engineering team can push the database scripts (DDLs/DMLs) without any restrictions and can collaborate effectively with other team members. 
  • Integration Branch (or QA/UAT Branch) – This branch will not support any direct code push and will have code changes using a pull request strategy. Once a pull request (PR) is raised from the collaboration branch and approved, the changes (DDLs & DMLs) will be merged into the integration branch.
  • Production Branch (or master branch) – once the validation process is completed in the integration workspace and the pull request is completed, database scripts will be merged into the production/main branch.

Step 3

Having GitHub action configured so that post pull request approval (from collaboration to integration and from integration to production), the database scripts can be deployed into the Snowflake workspace (integration or production context) automatically.

Figure-07 GitHub Action Setup To Trigger Flyway Migration

Step 4

Leverage a right-branching strategy by including one or more approval processes for integration and production branches.

Figure-08 Branching Strategy & PR Approval

Auditing & Monitoring Using Github Workflow

After completing the Flyway configuration and setting up GitHub Actions, along with establishing the branching strategies and PR approval process, the next step is to push the code to GitHub and obtain approval for the merge request. 

Once these conditions are met, the GitHub workflow will automatically trigger the execution of the scripts, creating the necessary Snowflake objects in the target environment.

Figure-09 provides database change management history, capturing the execution of all SQL scripts along with their check-sum and the user responsible for executing each script. This ensures a comprehensive record of the changes made to the Snowflake environment, enabling better tracking and accountability.

Figure-09 Flyway Audit History
Figure-09 Flyway Audit History

GitHub Workflow also tracks the successful execution, as well as failure, and a complete audit log, can be checked, including the time taken to run the scripts.

Figure-10 GitHub Workflow History

Conclusion

Enabling the CI/CD pipeline for Snowflake brings numerous benefits to data engineering teams. It ensures version control with change logs, improves code quality and consistency, enhances security, provides scheduling flexibility for builds and deployments, promotes collaboration among developers, offers insights through dashboards and reports, and ensures environment stability with automated rollback capabilities. 

By leveraging the CI/CD pipeline, data teams can achieve faster and more accurate deployments, reducing the possibility of human errors.

If you’re interested in exploring further best practices for Snowflake and CI/CD, we recommend downloading our comprehensive Getting Started with Snowflake Guide. This guide offers actionable steps that will assist you in maximizing the benefits of the Snowflake Data Cloud for your organization.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit