In modern data pipelines, continuous data integration and data frequent updates may cause unintended data errors. Blue-green Deployment ensures a final checkpoint for accurate results. In this blog, details are provided on how to implement blue green deployment in a data warehouse project.
Applied AI
Data Analytics
Data Engineering
Have you ever experienced an unexpected breakdown in your data pipeline that caused your dependencies to crash? Did it result in blank charts on your dashboards and inconsistent KPIs, or machine learning models trained with incomplete or outdated data?Discovering data quality issues from users is never a pleasant experience for the data team. Even worse, it can cause internal clients to lose confidence in the data. If you have faced this situation before, Blue-Green Deployment is a solution that can help.
In modern data pipelines, the source data is ingested continuously, and the data warehouse is built repeatedly based on the latest data. A new data warehouse version is created for the users with the latest transformation specifications with ever-shortening intervals: weekly, daily, multiple times a day, and even hourly, based on the situation.
From time to time, these updates in data and transformation may result in unintended consequences, and failures may happen. These failures may cause some or all of the final exposed data to be missing or wrong. Of course, there usually are tests (standard and custom tests for previously known scenarios) that catch and alert to possible failures. These tests warn the owners that something went wrong - while the data reaches production in that state, with errors. While the action is taken and fixes are applied, the data will continue to be exposed with errors or unavailable. This is not a desired outcome at all.
Ideally, there needs to be a final checkpoint to ensure everything has run as expected. If yes, the new data warehouse version can be exposed to the users. Blue-green Deployment does just that with the steps below:
While one environment (Blue) is in production, available for its dependencies, the other environment (Green) is waiting for the following data refresh.
When the data is updated, the data transformation steps are complete, and all tests indicate that the update was successful; the two environments "swap" places.
This happens momentarily in the data platform, and now your ML models, reports, and dashboards all point to this new data version - without interruption and with minimized (hopefully eliminated) risk of disruption.
In this case, the “swap” does not take place. The applications continue consuming data from the Blue environment. Data unavailability or inaccurate data scenarios are avoided. During this time, the root cause of the test failure is searched, addressed, and the refresh will be rerun.
Please note: The implementation steps below are based on the dbt and snowflake environment. This article will still be helpful regarding the essential concepts if dbt and Snowflake are not part of your current data stack.
Below we will walk you through all the steps to implement blue/green in a dbt project (let's say the project is called my_transformation ) that uses Snowflake as Data Warehouse.
First, we must ensure that we have the necessary structure for the blue-green deployment. For that, we need the following:
In this example, we will use the following:
Emil | Tobias |
All applications should read from the “blue”, and your development branches should be merged to the green database.
We'll use the same schema name, my_schema, for the blue and green databases.
Add configuration for snapshots for the target_database
snapshots:
+target_database: my_transformation_snapshots
a. ref()
This macro will overwrite the default behavior of the dbt ref macro. Now instead of getting the full path for a table, it will not include the database name. This is important since we will eventually swap the blue and green databases - and we do not want the green database reference to persist.
E.g.,
Default Behavior
Database_name.schema_name.table_name
New Behavior
schema_name.table_name
{% macro ref(model_name) %}
{% set rel = builtins.ref(model_name) %}
{% if rel.database == 'my_transformation_snapshots' %}
{% do return(builtins.ref(model_name)) %}
{% else %}
{% do return(builtins.ref(model_name).include(database=false)) %}
{% endif %}
{% endmacro %}
b. get_snapshot_write_schema()
Based on the dbt target name, this macro will return the schema_name where the snapshot model will be written. The behavior we want is to write to the production schema only when we are in production.
{% macro get_snapshot_write_schema() %}
{% if target.name in ('prod') %}
{% set var_target_schema = 'snapshots' %}
{% else %}
{% set var_target_schema = 'snapshots_' ~ target.name %}
{% endif %}
{% do return(var_target_schema) %}
{% endmacro %}
Make sure you add the following config to all snapshot models in your dbt_project
{{
config(
target_schema=get_snapshot_write_schema()
)
}}
c. ref_snapshot
This macro will get the relation for the snapshot model. Even in a dev environment, we want our models to read from production snapshots. This helps us emulate the production environment better while in development environments by
{% macro ref_snapshot(snapshot_name) %}
{% set snapshot_relation=ref(snapshot_name) %}
{% set prod_snapshot_relation = adapter.get_relation(
database='my_transformation_snapshots',
schema='snapshots',
identifier=snapshot_relation.identifier
) %}
{{ return(prod_snapshot_relation) }}
{% endmacro %}
Make sure you replace the ref macro to ref_snapshot when referencing a snapshot model e.g.
with some_cte as (
select * from {{ref('non_snapshot_model')}}
),
some_snapshot as (
select * from {{ref_snapshot('my_snapshot_model')}}
)
...
d. Swap_database
This macro implements the snowflake feature for swapping databases. This macro is used with a run-operation command in the next step.
{% macro swap_database() %}
{% set sql='alter database swap with ' %}
{% do run_query(sql) %}
{{ log(""database swapped"", info=True) }}
{% endmacro %}"
Add the swap database operation to the commands list in your dbt cloud job after the dbt build comment.
dbt build
dbt run-operation swap_database
Set the target name
Ensure the target name under your environment equals “prod.”
Set the database name.
Make sure the database name under your environment is equal to
“my_transformation_prod_green”
Whenever your project runs with the above blue-green implementation, the following events occur:
This checkpoint before deployment approach is one of the essential elements in data quality in your data warehouse.
Blue-green deployment can be further optimized when a comprehensive test set is built in your transformation project. You can find 205 Data Lab data quality checklist in this blog post: Data Quality Part I
Additionally, if you have any questions or would like to learn more about our specialized solutions, please don't hesitate to reach out to us at hello@205datalab.com
By leveraging bulk data from providers like data.ai, B2B companies selling to mobile app publishers and developers can optimize their sales efforts and achieve scalable results. Implementing a data-driven approach, integrating it with CRM systems, and continuously refining the strategy based on feedback will ensure that your sales team targets high-potential apps and publishers, ultimately leading to increased revenue and success. By moving away from manual and inefficient processes, you'll be able to achieve greater consistency and effectiveness in your sales efforts.
Are you interested in exploring this approach for your company and need implementation help? With over five years of experience working with mobile app data, our team can help you implement this type of solution, even if you don't have a data warehouse or access to internal data resources. Start a conversation with us today, and let us help your team sell more effectively and efficiently!
Data Analytics
#Data Warehouse
We’re excited to start our new blog series dedicated to conversations with data leaders. Each post will explore their challenges, innovations, and lessons learned in navigating the ever-evolving data landscape. We will be highlighting challenges they have overcome, pain points and the achievements they take the most pride in.
Read PostCRM systems have long served as the primary hub for customer data. Platforms like Salesforce and HubSpot excel at tracking customer interactions, managing sales pipelines, and providing basic reporting. These functions have made CRMs indispensable for sales and marketing teams seeking a centralized view of customer relationships.
Read PostIn recent years, the concept of fractional professionals has gained significant traction, especially in tech companies. Fractional executives, such as fractional CFOs, CMOs, and RevOps, provide their expertise to organizations on a part-time or project basis, offering a cost-effective and flexible solution for companies that may not require or cannot afford full-time executives in these roles.
Read PostA Go-To-Market (GTM) organization may find it beneficial to invest in a Data Warehouse (DWH) alongside their Salesforce CRM when they encounter specific indications that their data and analytics requirements have surpassed the analytics and reporting capabilities of Salesforce.
Read PostData quality is a critical aspect of any data warehouse project. With an ever increasing number of data sources and technologies, scoping and prioritization is daunting. In this document, we share our data quality framework for planning data quality actions, current state assessments, and prioritizing data quality initiatives.
Read PostCompanies targeting mobile app publishers can use mobile app data to identify potential customers. This blog discusses using bulk mobile app data for sales outreach.
Read PostStay in the loop with everything you need to know.