This entry-level Hands-On Lab exercise is designed to help you master the basics of Coalesce. In this lab, you'll explore the Coalesce interface, learn how to easily transform and model your data with our core capabilities, understand how to deploy and refresh version-controlled data pipelines, and build a ML Forecast node that forecasts sales values for a fiction food truck company.
By completing the steps we've outlined in this guide, you'll have mastered the basics of Coalesce and can venture into our more advanced features.
Coalesce is the first cloud-native, visual data transformation platform built for Snowflake. Coalesce enables data teams to develop and manage data pipelines in a sustainable way at enterprise scale and collaboratively transform data without the traditional headaches of manual, code-only approaches.
With Coalesce, you can:
Coalesce's unique architecture is built on the concept of column-aware metadata, meaning that the platform collects, manages, and uses column- and table-level information to help users design and deploy data warehouses more effectively. This architectural difference gives data teams the best that legacy ETL and code-first solutions have to offer in terms of flexibility, scalability and efficiency.
Data teams can define data warehouses with column-level understanding, standardize transformations with data patterns (templates) and model data at the column level.
Coalesce also uses column metadata to track past, current, and desired deployment states of data warehouses over time. This provides unparalleled visibility and control of change management workflows, allowing data teams to build and review plans before deploying changes to data warehouses.
Coalesce currently only supports Snowflake as its target database, As you will be using a trial Coalesce account created via Partner Connect, your basic database settings will be configured automatically and you can instantly build code.
A Coalesce organization is a single instance of the UI, set up specifically for a single prospect or customer. It is set up by a Coalesce administrator and is accessed via a username and password. By default, an organization will contain a single Project and a single user with administrative rights to create further users.
Projects provide a way of completely segregating elements of a build, including the source and target locations of data, the individual pipelines and ultimately the git repository where the code is committed. Therefore teams of users can work completely independently from other teams who are working in a different Coalesce Project.
Each Project requires access to a git repository and Snowflake account to be fully functional. A Project will default to containing a single Workspace, but will ultimately contain several when code is branched.
A Coalesce Workspace is an area where data pipelines are developed that point to a single git branch and a development set of Snowflake schemas. One or more users can access a single Workspace. Typically there are several Workspaces within a Project, each with a specific purpose (such as building different features). Workspaces can be duplicated (branched) or merged together.
A Coalesce Environment is a target area where code and job definitions are deployed to. Examples of an environment would include QA, PreProd and Production.
It isn't possible to directly develop code in an Environment, only deploy to there from a particular Workspace (branch). Job definitions in environments can only be run via the CLI or API (not the UI). Environments are shared across an entire project, therefore the definitions are accessible from all workspaces. Environments should always point to different target schemas (and ideally different databases), than any Workspaces.
As the lead Data & Analytics manager for TastyBytes Food Trucks, you're responsible for building and managing data pipelines that deliver insights to the rest of the company. There customer-related questions that the business needs to answer that will help with inventory planning and marketing. Included in this, is building a machine learning forecast that will allow management to determine sales volume for each item on the menu.
In order to help your extended team answer these questions, you'll need to build a customer data pipeline first.
To complete this lab, please create free trial accounts with Snowflake and Coalesce by following the steps below. You have the option of setting up Git-based version control for your lab, but this is not required to perform the following exercises. Please note that none of your work will be committed to a repository unless you set Git up before developing.
We recommend using Google Chrome as your browser for the best experience.
Once you are logged into your Snowflake account, sign up for a free Coalesce trial account using Snowflake Partner Connect. Check your Snowflake account profile to make sure that it contains your fist and last name.
Once you are logged into your Snowflake account, sign up for a free Coalesce trial account using Snowflake Partner Connect. Check your Snowflake account profile to make sure that it contains your fist and last name.
Congratulations! You've successfully created your Coalesce trial account.
frostbyte_tasty_bytes setup is now complete.
Once you've activated your Coalesce trial account and logged in, you will land in your Projects dashboard. Projects are a useful way of organizing your development work by a specific purpose or team goal, similar to how folders help you organize documents in Google Drive.
Your trial account includes a default Project to help you get started. Click on the Launch button next to your Development Workspace to get started.
You will need to add the ML Forecast node into your Coalesce workspace in order to complete this lab.
Let's get familiar with Coalesce by walking through the basic components of the user interface.
Once you've activated your Coalesce trial account and logged in, you will land in your Projects dashboard. Projects are a useful way of organizing your development work by a specific purpose or team goal, similar to how folders help you organize documents in Google Drive.
Your trial account includes a default Project to help you get started.
Nodes are visual representations of objects in Snowflake that can be arranged to create a Directed Acyclic Graph (DAG or Graph). A DAG is a conceptual representation of the nodes within your data pipeline and their relationships to and dependencies on each other.
Before you can begin transforming your data, you will need to configure storage locations. Storage locations represent a logical destination in Snowflake for your database objects such as views and tables.
CUSTOMER
for your raw customer data.POS
for your raw point-of-sale data. Coalesce has already created a database for our work to be mapped to (your WORK
location). CUSTOMER
location to the FROSTBYTE_TASTY_BYTES
database and the RAW_CUSTOMER
schema. POS
location to the FROSTBYTE_TASTY_BYTES
database and the RAW_POS
schema. Keep your WORK
location mappings the same. Select Save to ensure Coalesce stores your mappings. Let's start to build the foundation of your data pipeline by creating a Graph (DAG) and adding data in the form of Source nodes.
CUSTOMER
source nodes (CUSTOMER_LOYALTY)
and the MENU
, ORDER_DETAIL
, ORDER_HEADER
from the POS
source storage location. Alternatively, if available, you can use ORDER_HEADER_14M
and ORDER_DETAIL_38M.
Then click the Add Sources button. Now that you've added your Source nodes, let's prepare the data by adding business logic with Stage nodes. Stage nodes represent staging tables within Snowflake where transformations can be previewed and performed. Let's start by adding a standardized "stage layer" for all sources.
STG_ORDER_HEADER
POS
location. Your Node Editor is used to edit the node object, the columns within it, and its relationship to other nodes. There are a few different components to the Node Editor.
STG_CUSTOMER_LOYALTY
node to open up your Node Editor. The large middle section is your Mapping grid, where you can see the structure of your node along with column metadata like transformations, data types, and sources. Now let's apply a single column transformation in your STG_CUSTOMER_LOYALTY
node by applying a consistent naming convention to our customer names. This will ensure that any customer names are standardized.
FIRST_NAME
column and click the blank Transform field in the Mapping grid. Enter the following transformation and then press Enter:UPPER({{SRC}})
LAST_NAME
column: UPPER({{SRC}})
In order to prepare the data within your data pipeline, it needs to be processed more with an additional stage layer.
STG_ORDER_HEADER
data with your STG_ORDER_DETAIL
data by holding down the Shift key and clicking on both nodes. FROM {{ ref('WORK', 'STG_ORDER_HEADER') }} "STG_ORDER_HEADER"
INNER
JOIN {{ ref('WORK', 'STG_ORDER_DETAIL') }} "STG_ORDER_DETAIL"
ON "STG_ORDER_HEADER".
"ORDER_ID"
= "STG_ORDER_DETAIL".
"ORDER_ID"
ORDER_ID
and DISCOUNT_ID
columns, we will need to rename or remove one of the columns to avoid getting a duplicate column name error. Let's delete the ORDER_ID
and DISCOUNT_ID
from the STG_ORDER_DETAIL
. STG_ORDER_MASTER.
STG_ORDER_MASTER
node and preview the contents of your node by clicking Fetch Data.Now let's experiment with creating Dimension nodes. These nodes are generally descriptive in nature and can be used to track particular aspects of data over time (such as time or location). Coalesce currently supports Type 1 and Type 2 slowly changing dimensions, which we will explore in this section.
STG_CUSTOMER_LOYALTY
and right click, create Dimension node. Because the demographics of your customer base changes over time, creating a Dimension node on top of this data will allow you to capture customer changes over time.CUSTOMER_ID
and press the > button to select it as your business key. FIRST_NAME, LAST_NAME, EMAIL, PHONE_NUMBER
columns under Change Tracking. This will allow you to track changes to these columns over time. STG_MENU
node. Right click and select Add Node > Dimension from the dropdown menu.DIM_MENU
node, navigate to the right and expand the Options drop down in the Config section. Under Business Key, check the box next to MENU_ITEM_ID
and press the > button to select it as your business key. Fact nodes represent Coalesce's implementation of a Kimball Fact Table, which consists of the measurements, metrics, or facts of a business process and is typically located at the center of a star or Snowflake schema surrounded by dimension tables.
Let's create a Fact table to allow us to query the metrics surrounding our customer orders.
STG_ORDER_MASTER
node. Then right click and select Add Node > Fact.ORDER_DETAIL_ID.
Then press the > button to select them as your business key. FCT_ORDER_MASTER
node.Now let's create a view for our analysts to to query allowing them to answer questions they have about our customers.
FCT_ORDER_MASTER, DIM_CUSTOMER_LOYALTY
, and DIM_MENU,
right click > join nodes > view. V_CUSTOMER_ORDER_HISTORY
. To keep the data relevant to customer order history, we're going to delete many of these columns. First, delete all of the SYSTEM_
columns. Next, delete the CUSTOMER_ID
from the DIM_CUSTOMER_LOYALTY
node and the MENU_ITEM_ID
from the DIM_MENU
node. FROM {{ ref('WORK', 'FCT_ORDER_MASTER') }} "FCT_ORDER_MASTER"
LEFT
JOIN {{ ref('WORK', 'DIM_MENU') }} "DIM_MENU"
ON "FCT_ORDER_MASTER".
"MENU_ITEM_ID"
= "DIM_MENU".
"MENU_ITEM_ID"
LEFT
JOIN {{ ref('WORK', 'DIM_CUSTOMER_LOYALTY') }} "DIM_CUSTOMER_LOYALTY"
ON "FCT_ORDER_MASTER".
"CUSTOMER_ID"
= "DIM_CUSTOMER_LOYALTY".
"CUSTOMER_ID"
While you were building your pipeline, you may have noticed there was a VARIANT
column containing JSON
data in our STG_MENU
node. You'll want to use this data to enrich your nodes and make more informed decisions. This means you first need to derive the JSON
mappings from the VARIANT
column into individual columns. Then, you need to propagate those columns to all of your downstream nodes in order to use them for decision making.
STG_MENU
node, select the MENU_ITEM_HEALTH_METRICS_OBJ
column. You'll notice the Data type is VARIANT
. Right click on the column and select Derive Mappings > From JSON
. Coalesce will automatically unnest this data into their own separate columns.VARIANT
column. You can bulk edit these columns and quickly update the name to remove the unneeded prefix. menu_item_health_metrics_ingrediants
column, hold shift, and select menu_item_id
. This will highlight all of the columns you just parsed. Right click on any of the columns and select Bulk Edit. {{ column.name |
replace("menu_item_health_metrics_", "") |
upper }}
JSON
column and have bulk edited them to be more understandable, you can propagate those columns to all of the applicable downstream nodes. ingredients
column, hold down shift, and select the IS_NUT_FREE_FLAG
column to highlight all of the columns you want to propagate. STG_MENU
node. Click on the ellipses next to any of the columns and select Propagate Addition. DIM_MENU
and V_CUSTOMER_ORDER_HISTORY
in order to propagate the columns to these nodes. FCT_ORDER_MASTER
node and the DIM_MENU
node by holding down the shift key and clicking on both nodes. Right click on either node and select Join Nodes, and then select Stage. This will bring you to the node editor of your new Stage node.FROM {{ ref('WORK', 'DIM_MENU') }} "DIM_MENU"
INNER
JOIN {{ ref('WORK', 'FCT_ORDER_MASTER') }} "FCT_ORDER_MASTER"
ON "DIM_MENU".
"MENU_ITEM_ID"
= "FCT_ORDER_MASTER".
"MENU_ITEM_ID"
STG_ORDER_MASTER_ITEMS.
MENU_ITEM_ID
columns, as well as all the SYSTEM_
columns and then create and run the node. Now we will prepare the data for our ML forecast by selecting the columns we want to work with and applying single column transformations to that data.
The ML Forecast node we will use later in the lab will rely on three columns to produce a forecast:
We will structure our STG_ORDER_MASTER_ITEMS
node so that it only contains the columns we will need for our ML Forecast Node.
MENU_ITEM_NAME
-> Series columnORDER_AMOUNT
-> Target columnORDER_TS
-> Timestamp columnORDER_AMOUNT
, ORDER_TS
, and MENU_ITEM_NAME
columns, and right click on either of the columns and select, Bulk Edit. FORECAST
prefix to each of the columns we want to keep. Select Preview -> Update to rename the columns. FORECAST_{{column.name}}
FORECAST_
prefix, right click on them, and delete them. Select the FORECAST_ORDER_TS
column and click into the transform field in the mapping grid. You can cast a timestamp as a date, removing the seconds, minutes, and hours, by using the transformation below:
{{SRC}}::date
ORDER_AMOUNT
column. sum({{SRC}})
MENU_ITEM_NAME
column as it is. However, because we have included an aggregate function, we will need to use a GROUP BY
in the node. Navigate to the JOIN
tab so we can finish preparing this node. To do this, within the editor, hit the enter key and add a new line to the editor. Then, type WHERE to add a filter clause.
For the conditions in our filter, we will first filter where the ORDER_TS
column is greater than or equal to 2022-01-01
. We will also add a filter where the ITEM_CATEGORY
is equal to Beverage
. You can copy and paste the code for this filter below:
WHERE "ORDER_TS" >= '2022-01-01' and "ITEM_CATEGORY" = 'Beverage'
GROUP BY
statement. Within the editor, add a new line by hitting the enter key. Next, type GROUP BY
within the editor, followed by the two non-aggregated columns, MENU_ITEM_NAME
and ORDER_TS
. You can copy the code for this below:GROUP BY ALL
As the final step of building out this data pipeline for operationalizing the ML Forecast Node, we will add in the ML Forecast Node to the pipeline.
STG_ORDER_MASTER_ITEMS
node that we have been working with and right click on it. Select Add Node -> CortexML -> Forecast. FORECAST
, LOWER_BOUND
, and UPPER_BOUND
. These are generated automatically by the ML Forecast node and will be used to forecast the values of the configured node. Within the config section, select the Forecast Model Input dropdown. This is where we will configure the node to forecast the values we want to predict. Since there are multiple menu items that we will be predicting order amounts for, we will leave the Multi-Series Forecast toggle to on.
ORDER_ITEM_FORECAST.
FORECAST_MENU_ITEM_NAME
column. This Series Column is the column that is used as the series for predicting values for. FORECAST_ORDER_TS
column. FORECAST_ORDER_AMOUNT
column. The Target Column is the column that contains the values we want to predict. We are presented with a Days to Forecast input that defaults to 30 days. We can update this value to any value of our choosing, but for the sake of this lab guide, will leave the value at 30.
Now that we have configured, created, and run the ML Forecast node within our Snowflake account, we can navigate to Snowflake to see how our forecasted values show up.
PC_COALESCE_DB.PUBLIC
database and schema, and is likely called ML_FCTS_ORDER_MASTER_ITEMS
, unless you manually changed the name during the last section. Within Snowflake, run a select statement that queries all of the data within the table. You can copy and paste the code below, assuming you are using the default table name for your ML Forecast node:
select *
from pc_coalesce_db.public.ml_fcsts_order_master_items
FORECAST_ORDER_AMOUNT
as the column value. Congratulations on completing your lab! You've mastered the basics of Coalesce and are now equipped to venture into our more advanced features. Be sure to reference this exercise if you ever need a refresher.
We encourage you to continue working with Coalesce by using it with your own data and use cases and by using some of the more advanced capabilities not covered in this lab.
Continue with your free trial by loading your own sample or production data and exploring more of Coalesce's capabilities with our documentation and resources. For more technical guides on advanced Coalesce and Snowflake features, check out Snowflake Quickstart guides covering Dynamic Tables and Cortex ML functions.
Reach out to our sales team at coalesce.io or by emailing sales@coalesce.io to learn more!