Prepare training and validation dataset for facies classification using a Snowflake OAuth connection and Amazon SageMaker Canvas

February 2024: This post was reviewed and updated for accuracy.

This post is co-written with Thatcher Thornberry from bpx energy.

Facies classification is the process of segmenting lithologic formations from geologic data at the wellbore location. During drilling, wireline logs are obtained, which have depth-dependent geologic information. Geologists are deployed to analyze this log data and determine depth ranges for potential facies of interest from the different types of log data. Accurately classifying these regions is critical for the drilling processes that follow.

Facies classification using AI and machine learning (ML) has become an increasingly popular area of investigation for many oil majors. Many data scientists and business analysts at large oil companies don’t have the necessary skillset to run advanced ML experiments on important tasks such as facies classification. To address this, we show you how to easily prepare and train a best-in-class ML classification model on this problem.

In this post, aimed primarily at those who are already using Snowflake, we explain how you can import both training and validation data for a facies classification task from Snowflake into Amazon SageMaker Canvas and subsequently train the model using a 3+ category prediction model.

Solution overview

Our solution consists of the following steps:

Upload facies CSV data from your local machine to Snowflake. For this post, we use data from the following open-source GitHub repo.
Configure AWS Identity and Access Management (IAM) roles for Snowflake and create a Snowflake integration (optional, see Configure Snowflake Data Import Permissions section of these docs.)
Set up OAuth for Snowflake
Create a secret for Snowflake credentials (mandatory for OAuth connections).
Import Snowflake directly into Canvas.
Build a facies classification model.
Analyze the model.
Run batch and single predictions using the multi-class model.
Share the trained model using Amazon SageMaker Studio.

Prerequisites

Prerequisites for this post include the following:

An AWS account.
Canvas set up, with an Amazon SageMaker user profile associated with it.
A Snowflake account. For steps to create a Snowflake account, refer to How to: Create a Snowflake Free Trial Account
The Snowflake CLI. For steps to connect to Snowflake by CLI, refer to SnowSQL, the command line interface for connecting to Snowflake. For steps to connect to to Snowflake by CLI, refer to Snowflake SnowSQL: Command Line Tool to access Snowflake.
An existing database within Snowflake.

Upload facies CSV data to Snowflake

In this section, we take two open-source datasets and upload them directly from our local machine to a Snowflake database. From there, we set up an integration layer between Snowflake and Canvas.

Download the training_data.csv and validation_data_nofacies.csv files to your local machine. Make note of where you saved them.
Ensuring that you have the correct Snowflake credentials and have installed the Snowflake CLI desktop app, you can federate in. For more information, refer to Log into SnowSQL.
Select the appropriate Snowflake warehouse to work within, which in our case is COMPUTE_WH:
```
USE WAREHOUSE COMPUTE_WH;
```
Choose a database to use for the remainder of the walkthrough:
```
use demo_db;
```
Create a named file format that will describe a set of staged data to access or load into Snowflake tables.

This can be run either in the Snowflake CLI or in a Snowflake worksheet on the web application. For this post, we run a SnowSQL query in the web application. See Getting Started With Worksheets for instructions to create a worksheet on the Snowflake web application.

Create a table in Snowflake using the CREATE statement.

The following statement creates a new table in the current or specified schema (or replaces an existing table).

It’s important that the data types and the order in which they appear are correct, and align with what is found in the CSV files that we previously downloaded. If they’re inconsistent, we’ll run into issues later when we try to copy the data across.

Do the same for the validation database.

Note that the schema is a little different to the training data. Again, ensure that the data types and column or feature orders are correct.

Load the CSV data file from your local system into the Snowflake staging environment:
1. The following is the syntax of the statement for Windows OS:
```
put file://D:\path-to-file.csv @DB_Name.PUBLIC.%table_name;
```
2. The following is the syntax of the statement for Mac OS:
```
put file:///path-to-file.csv @DB_NAME.PUBLIC.%table_name;
```

The following screenshot shows an example command and output from within the SnowSQL CLI.

Copy the data into the target Snowflake table.

Here, we load the training CSV data to the target table, which we created earlier. Note that you have to do this for both the training and validation CSV files, copying them into the training and validation tables, respectively.

Verify that the data has been loaded into the target table by running a SELECT query (you can do this for both the training and validation data):
```
select * from TRAINING_DATA
```

Configure Snowflake IAM roles and create the Snowflake integration [Optional]

In this section, we cover creating the policy required for Snowflake to access data in an Amazon Simple Storage Service (Amazon S3) bucket of your choosing. If you already have a policy and role that allows access to the S3 bucket you plan to use for the Amazon SageMaker Data Wrangler output, you can skip this section and the next section, and start creating your storage integration in Snowflake. By default, Data Wrangler queries the data in Snowflake without creating a copy of it in an Amazon S3 location. Use the following information if you’re configuring a storage integration with Snowflake. Your users can use a storage integration to store their query results in an Amazon S3 location.

Retrieve the IAM user for your Snowflake account

Once you have successfully configured your Snowflake storage integration, run the following DESCRIBE INTEGRATION command to retrieve the ARN for the IAM user that was created automatically for your Snowflake account:

DESC INTEGRATION SAGEMAKER_CANVAS_INTEGRATION;

Record the following values from the output:

STORAGE_AWS_IAM_USER_ARN – The IAM user created for your Snowflake account
STORAGE_AWS_EXTERNAL_ID – The external ID needed to establish a trust relationship

Update the IAM role trust policy

Now we update the trust policy:

On the IAM console, choose Roles in the navigation pane.
Choose the role you created.
On the Trust relationship tab, choose Edit trust relationship.

Modify the policy document as shown in the following code with the DESC STORAGE INTEGRATION output values you recorded in the previous step.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": "<snowflake_user_arn>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<snowflake_external_id>"
                }
            }
        }
    ]
}

Choose Update trust policy.

Create an external stage in Snowflake

We use an external stage within Snowflake for loading data from an S3 bucket in your own account into Snowflake. In this step, we create an external (Amazon S3) stage that references the storage integration you created. For more information, see Creating an S3 Stage.

This requires a role that has the CREATE_STAGE privilege for the schema as well as the USAGE privilege on the storage integration. You can grant these privileges to the role as shown in the code in the next step.

Create the stage using the CREATE_STAGE command with placeholders for the external stage and S3 bucket and prefix. The stage also references a named file format object called my_csv_format:

grant create stage on schema public to role <iam_role>;
grant usage on integration SAGEMAKER_CANVAS_INTEGRATION to role <iam_role_arn>;
create stage <external_stage>
storage_integration = SAGEMAKER_CANVAS_INTEGRATION
url = '<s3_bucket>/<prefix>'
file_format = my_csv_format;

Set up OAuth for Snowflake

To set up authentication for Snowflake, Canvas supports identity providers that you can use instead of having users directly enter their credentials into Canvas.

The following are links to the Snowflake documentation for the identity providers that Canvas supports:

The following process describes the general steps you must take. For more detailed instructions about performing these steps, you can refer to the Setting up Snowflake OAuth Access section in the Data Wrangler documentation for importing data from Snowflake.

To set up OAuth for Snowflake, do the following:

Register Canvas as an application with the identity provider. This requires specifying a redirect URL to Canvas, which should follow this format: https://<domain-id>.studio.<region>.sagemaker.aws/canvas/default
Within the identity provider, create a server or API that sends OAuth tokens to Canvas so that Canvas can access Snowflake. When setting up the server, use the authorization code and refresh token grant types, specify the access token lifetime, and set a refresh token policy. Additionally, within the External OAuth Security Integration for Snowflake, enable external_oauth_any_role_mode.
Get the following information from the identity provider: token URL, authorization URL, client ID, client secret. For Azure AD, also retrieve the OAuth scope credentials.

Store the information retrieved in the previous step in an AWS Secrets Manager secret.

For Okta and Ping Federate, the secret should look like the following format:

{"token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
"client_id":"example-client-id", "client_secret":"example-client-secret", "identity_provider":"OKTA"|"PING_FEDERATE",
"authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize"}

For Azure AD, the secret should also include the OAuth scope credentials as the datasource_oauth_scope

Create an OAuth for Snowflake secret in secrets manager

After configuring the identity provider and the secret, you must create an AWS Secrets Manager secret to store the information and add it to your Amazon SageMaker Domain or user profile. Note that you can add a secret to both a Domain and user profile, but Canvas looks for secrets in the user profile first.

To add a secret to your Domain or user profile, do the following:

Go to the Amazon SageMaker console
Choose Domains in the navigation pane.
From the list of Domains, choose your Domain.
1. If adding your secret to your Domain, do the following:
  1. Choose the Domain.
  2. On the Domain settings page, choose the Domain settings tab.
  3. Choose Edit.
2. If adding the secret to your user profile, do the following:
  1. Choose the user’s Domain.
  2. On the Domain settings page, choose the user profile.
  3. On the User Details page, choose Edit.
In the navigation pane, choose Canvas settings.
For OAuth settings, choose Add OAuth configuration.
For Data source, select Snowflake.
For Secret Setup, select Create a new secret. Alternatively, if you already created an AWS Secrets Manager secret with your credentials, enter the ARN for the secret. If creating a new secret, do the following:
1. For Identity Provider, select SNOWFLAKE.
2. For Client ID, Client Secret, Authorization URL, and Token URL, enter all of the information you gathered from the identity provider in the previous procedure.
Save your Domain or user profile settings.

You should now be able to create a connection to your data in Snowflake from Canvas.

Import Snowflake directly into Canvas

To import your facies dataset directly into Canvas, complete the following steps:

On the SageMaker console, choose Amazon SageMaker Canvas in the navigation pane.
Choose your user profile and choose Open Canvas.
On the Canvas landing page, choose Data Wrangler on the left navigation pane.
The user now either has two options: a) to import a standalone dataset or b) create a data wrangler Data flow within Canvas. For this exercise we will pursue option a).
Choose Create, then Tabular whilst on the Datasets tab.
Name your dataset and then choose Data Source as Snowflake, then select Add Connection.
If you have yet to successfully create an OAuth connection within the SageMaker console, you will receive the following error. Please revisit step “Create an OAuth for Snowflake secret in secrets manager”.
If you have successfully created an OAuth connection, you can proceed to enter in a new Connection Name and your Account ID. Canvas will do the rest.

If all the entries are valid, you should see all the databases associated with the connection in the navigation pane (see the following example for NICK_FACIES).

Choose the TRAINING_DATA table, then choose Preview dataset.

If you’re happy with the data, you can edit the custom SQL in the data visualizer.

Choose Edit in SQL.

Run the following SQL command before importing into Canvas. (This assumes that the database is called NICK_FACIES. Replace this value with your database name.)

SELECT "FACIES", "FORMATION", "WELL_NAME", "DEPTH", "GR", "ILD_LOG10", "DELTAPHI", "PHIND", "PE", "NM_M", "RELPOS" FROM "NICK_FACIES"."PUBLIC"."TRAINING_DATA";

Something similar to the following screenshot should appear in the Import preview section.

If you’re happy with the preview, choose Import data.
Choose an appropriate data name, ensuring that it’s unique and fewer than 32 characters long.

Use the following command to import the validation dataset, using the same method as earlier:

SELECT "FORMATION", "WELL_NAME", "DEPTH", "GR", "ILD_LOG10", "DELTAPHI", "PHIND", "PE", "NM_M", "RELPOS" FROM "NICK_FACIES"."PUBLIC"."VALIDATION_DATA";

Build a facies classification model

To build your facies classification model, complete the following steps:

Choose Models in the navigation pane, then choose New Model.
Give your model a suitable name and choose Predictive analysis problem type.
On the Select tab, choose the recently imported training dataset, then choose Select dataset.
On the Build tab, drop the WELL_NAME column by unchecking the ticked box.

We do this because the well names themselves aren’t useful information for the ML model. They are merely arbitrary names that we find useful to distinguish between the wells themselves. The name we give a particular well is irrelevant to the ML model.

Choose FACIES as the target column.
Leave Model type as 3+ category prediction.
Validate the data.
Choose Standard build (in the drop-down below the Quick build).

Your page should look similar to the following screenshot just before building your model.

After you choose Standard build, the model enters the analyze stage. You’re provided an expected build time. You can now close this window, log out of Canvas (in order to avoid charges), and return to Canvas at a later time.

Analyze the facies classification model

To analyze the model, complete the following steps:

Federate back into Canvas.
Locate your previously created model, choose View, then choose Analyze.
On the Overview tab, you can see the impact that individual features are having on the model output.
In the right pane, you can visualize the impact that a given feature (X axis) is having on the prediction of each facies class (Y axis).

These visualizations will change accordingly depending on the feature you select. We encourage you to explore this page by cycling through all 9 classes and 10 features.

On the Scoring tab, we can see the predicted vs. actual facies classification.
Choose Advanced metrics to view F1 scores, average accuracy, precision, recall, and AUC.

Again, we encourage viewing all the different classes.

Choose Download to download an image to your local machine.

In the following image, we can see a number of different advanced metrics, such as the F1 score. In statistical analysis, the F1 score conveys the balance between the precision and the recall of a classification model, and is computed using the following equation: 2*((Precision * Recall)/ (Precision + Recall)).

Run batch and single prediction using the multi-class facies classification model

To run a prediction, complete the following steps:

Choose Single prediction to modify the feature values as needed, and get a facies classification returned on the right of the page.

You can then copy the prediction chart image to your clipboard, and also download the predictions into a CSV file.

Choose Batch prediction and then choose Select dataset to choose the validation dataset you previously imported.
Choose Generate predictions.

You’re redirected to the Predict page, where the Status will read Generating predictions for a few seconds.

After the predictions are returned, you can preview, download, or delete the predictions by choosing the options menu (three vertical dots) next to the predictions.

The following is an example of a predictions preview.

Share a trained model in Studio

You can now share the latest version of the model with another Studio user. This allows data scientists to review the model in detail, test it, make any changes that may improve accuracy, and share the updated model back with you.

The ability to share your work with a more technical user within Studio is a key feature of Canvas, given the key distinction between ML personas’ workflows. Note the strong focus here on collaboration between cross-functional teams with differing technical abilities.

Choose Share to share the model.
Choose which model version to share. (Note that you can only share one version per model).
Enter the Studio user to share the model with.
Add an optional note.
Choose Share.

Conclusion

In this post, we showed how with just a few clicks in Amazon SageMaker Canvas you can prepare and import your data from Snowflake, join your datasets, analyze estimated accuracy, verify which columns are impactful, train the best performing model, and generate new individual or batch predictions. We’re excited to hear your feedback and help you solve even more business problems with ML. To build your own models, see Getting started with using Amazon SageMaker Canvas.

About the Authors

Nick McCarthy is a Machine Learning Engineer in the AWS Professional Services team. He has worked with AWS clients across various industries including healthcare, finance, sports, telecoms and energy to accelerate their business outcomes through the use of AI/ML. Working with the bpx data science team, Nick recently finished building bpx’s Machine Learning platform on Amazon SageMaker.

Thatcher Thornberry is a Machine Learning Engineer at bpx Energy. He supports bpx’s data scientists by developing and maintaining the company’s core Data Science platform in Amazon SageMaker. In his free time he loves to hack on personal coding projects and spend time outdoors with his wife.

Audit History

Last reviewed and updated in February 2024 by Nick McCarthy | Sr. Machine Learning Engineer

Artificial Intelligence