Entry Snowflake knowledge utilizing OAuth-based authentication in Amazon SageMaker Information Wrangler

On this publish, we present how one can configure a brand new OAuth-based authentication characteristic for utilizing Snowflake in Amazon SageMaker Data Wrangler. Snowflake is a cloud knowledge platform that gives knowledge options for knowledge warehousing to knowledge science. Snowflake is an AWS Partner with a number of AWS accreditations, together with AWS competencies in machine studying (ML), retail, and knowledge and analytics.

Information Wrangler simplifies the info preparation and have engineering course of, decreasing the time it takes from weeks to minutes by offering a single visible interface for knowledge scientists to pick out and clear knowledge, create options, and automate knowledge preparation in ML workflows with out writing any code. You possibly can import knowledge from a number of knowledge sources, equivalent to Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Amazon EMR, and Snowflake. With this new characteristic, you need to use your personal identification supplier (IdP) equivalent to Okta, Azure AD, or Ping Federate to connect with Snowflake through Information Wrangler.

Resolution overview

Within the following sections, we offer steps for an administrator to arrange the IdP, Snowflake, and Studio. We additionally element the steps that knowledge scientists can take to configure the info movement, analyze the info high quality, and add knowledge transformations. Lastly, we present how one can export the info movement and prepare a mannequin utilizing SageMaker Autopilot.

Stipulations

For this walkthrough, you must have the next stipulations:

For admin:
- A Snowflake person with permissions to create storage integrations, and safety integrations in Snowflake.
- An AWS account with permissions to create AWS Identity and Access Management (IAM) insurance policies and roles.
- Entry and permissions to configure IDP to register Information Wrangler utility and arrange the authorization server or API.

For knowledge scientist:

Administrator setup

As an alternative of getting your customers immediately enter their Snowflake credentials into Information Wrangler, you’ll be able to have them use an IdP to entry Snowflake.

The next steps are concerned to allow Information Wrangler OAuth entry to Snowflake:

Configure the IdP.
Configure Snowflake.
Configure SageMaker Studio.

Configure the IdP

To arrange your IdP, you need to register the Information Wrangler utility and arrange your authorization server or API.

Register the Information Wrangler utility inside the IdP

Discuss with the next documentation for the IdPs that Information Wrangler helps:

Use the documentation supplied by your IdP to register your Information Wrangler utility. The data and procedures on this part provide help to perceive how one can correctly use the documentation supplied by your IdP.

Particular customizations along with the steps within the respective guides are known as out within the subsections.

Choose the configuration that begins the method of registering Information Wrangler as an utility.
Present the customers inside the IdP entry to Information Wrangler.
Allow OAuth consumer authentication by storing the consumer credentials as a Secrets and techniques Supervisor secret.
Specify a redirect URL utilizing the next format: https://domain-ID.studio.AWS Area.sagemaker.aws/jupyter/default/lab.

You’re specifying the SageMaker area ID and AWS Area that you just’re utilizing to run Information Wrangler. You could register a URL for every area and Area the place you’re operating Information Wrangler. Customers from a website and Area that don’t have redirect URLs arrange for them received’t be capable to authenticate with the IdP to entry the Snowflake connection.

Make sure that the authorization code and refresh token grant varieties are allowed in your Information Wrangler utility.

Arrange the authorization server or API inside the IdP

Inside your IdP, you need to arrange an authorization server or an utility programming interface (API). For every person, the authorization server or the API sends tokens to Information Wrangler with Snowflake because the viewers.

Snowflake makes use of the idea of roles which might be distinct from IAM roles utilized in AWS. You could configure the IdP to make use of ANY Function to make use of the default position related to the Snowflake account. For instance, if a person has methods administrator because the default position of their Snowflake profile, the connection from Information Wrangler to Snowflake makes use of methods administrator because the position.

Use the next process to arrange the authorization server or API inside your IdP:

Out of your IdP, start the method of organising the server or API.
Configure the authorization server to make use of the authorization code and refresh token grant varieties.
Specify the lifetime of the entry token.
Set the refresh token idle timeout.

The idle timeout is the time that the refresh token expires if it’s not used. When you’re scheduling jobs in Information Wrangler, we suggest making the idle timeout time larger than the frequency of the processing job. In any other case, some processing jobs may fail as a result of the refresh token expired earlier than they may run. When the refresh token expires, the person should re-authenticate by accessing the connection that they’ve made to Snowflake by means of Information Wrangler.

Be aware that Information Wrangler doesn’t help rotating refresh tokens. Utilizing rotating refresh tokens may lead to entry failures or customers needing to log in incessantly.

If the refresh token expires, your customers should reauthenticate by accessing the connection that they’ve made to Snowflake by means of Information Wrangler.

Specify session:role-any as the brand new scope.

For Azure AD, you need to additionally specify a singular identifier for the scope.

After you’ve arrange the OAuth supplier, you present Information Wrangler with the knowledge it wants to connect with the supplier. You should use the documentation out of your IdP to get values for the next fields:

Token URL – The URL of the token that the IdP sends to Information Wrangler
Authorization URL – The URL of the authorization server of the IdP
Consumer ID – The ID of the IdP
Consumer secret – The key that solely the authorization server or API acknowledges
OAuth scope – That is for Azure AD solely

Configure Snowflake

To configure Snowflake, full the directions in Import data from Snowflake.

Use the Snowflake documentation in your IdP to arrange an exterior OAuth integration in Snowflake. See the earlier part Register the Information Wrangler utility inside the IdP for extra data on how one can arrange an exterior OAuth integration.

While you’re organising the safety integration in Snowflake, ensure you activate external_oauth_any_role_mode.

Configure SageMaker Studio

You retailer the fields and values in a Secrets and techniques Supervisor secret and add it to the Studio Lifecycle Configuration that you just’re utilizing for Information Wrangler. A Lifecycle Configuration is a shell script that mechanically hundreds the credentials saved within the secret when the person logs into Studio. For details about creating secrets and techniques, see Move hardcoded secrets to AWS Secrets Manager. For details about utilizing Lifecycle Configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker Studio.

Create a secret for Snowflake credentials

To create your secret for Snowflake credentials, full the next steps:

On the Secrets and techniques Supervisor console, select Retailer a brand new secret.
For Secret kind, choose Different kind of secret.
Specify the main points of your secret as key-value pairs.

Key names require lowercase letters attributable to case sensitivity. Information Wrangler offers a warning in case you enter any of those incorrectly. Enter the key values as key-value pairs Key/worth in case you’d like, or use the Plaintext possibility.

The next is the format of the key used for Okta. In case you are utilizing Azure AD, you must add the datasource_oauth_scope discipline.

"PING_FEDERATE",
    "authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize"

Replace the previous values together with your selection of IdP and data gathered after utility registration.
Select Subsequent.
For Secret title, add the prefix AmazonSageMaker (for instance, our secret is AmazonSageMaker-DataWranglerSnowflakeCreds).
Within the Tags part, add a tag with the important thing SageMaker and worth true.
Select Subsequent.
The remainder of the fields are elective; select Subsequent till you could have the choice to decide on Retailer to retailer the key.

After you retailer the key, you’re returned to the Secrets and techniques Supervisor console.

Select the key you simply created, then retrieve the key ARN.
Retailer this in your most popular textual content editor to be used later once you create the Information Wrangler knowledge supply.

Create a Studio Lifecycle Configuration

To create a Lifecycle Configuration in Studio, full the next steps:

On the SageMaker console, select Lifecycle configurations within the navigation pane.
Select Create configuration.
Select Jupyter server app.

Create a brand new lifecycle configuration or append an present one with the next content material:

#!/bin/bash
set -eux
## Script Physique
cat > ~/.snowflake_identity_provider_oauth_config <<EOL
{
    "secret_arn": "<secret_arn>"
}
EOL

The configuration creates a file with the title ".snowflake_identity_provider_oauth_config", containing the key within the person’s residence folder.

Select Create Configuration.

Set the default Lifecycle Configuration

Full the next steps to set the Lifecycle Configuration you simply created because the default:

On the SageMaker console, select Domains within the navigation pane.
Select the Studio area you’ll be utilizing for this instance.
On the Atmosphere tab, within the Lifecycle configurations for private Studio apps part, select Connect.
For Supply, choose Present configuration.
Choose the configuration you simply made, then select Connect to area.
Choose the brand new configuration and select Set as default, then select Set as default once more within the pop-up message.

Your new settings ought to now be seen below Lifecycle configurations for private Studio apps as default.

Shut down the Studio app and relaunch for the adjustments to take impact.

Information scientist expertise

On this part, we cowl how knowledge scientists can connect with Snowflake as an information supply in Information Wrangler and put together knowledge for ML.

Create a brand new knowledge movement

To create your knowledge movement, full the next steps:

On the SageMaker console, select Amazon SageMaker Studio within the navigation pane.
Select Open Studio.
On the Studio Residence web page, select Import & put together knowledge visually. Alternatively, on the File drop-down, select New, then select SageMaker Information Wrangler Move.

Creating a brand new movement can take a couple of minutes.

On the Import knowledge web page, select Create connection.
Select Snowflake from the listing of information sources.
For Authentication methodology, select OAuth.

When you don’t see OAuth, confirm the previous Lifecycle Configuration steps.

Enter particulars for Snowflake account title and Storage integration.
Ener a connection title and select Join.

You’re redirected to an IdP authentication web page. For this instance, we’re utilizing Okta.

Enter your person title and password, then select Check in.

After the authentication is profitable, you’re redirected to the Studio knowledge movement web page.

On the Import knowledge from Snowflake web page, browse the database objects, or run a question for the focused knowledge.
Within the question editor, enter a question and preview the outcomes.

Within the following instance, we load Mortgage Information and retrieve all columns from 5,000 rows.

Select Import.
Enter a dataset title (for this publish, we use snowflake_loan_dataset) and select Add.

You’re redirected to the Put together web page, the place you’ll be able to add transformations and analyses to the info.

Information Wrangler makes it simple to ingest knowledge and carry out knowledge preparation duties equivalent to exploratory knowledge evaluation, characteristic choice, and have engineering. We’ve solely coated a number of of the capabilities of Information Wrangler on this publish on knowledge preparation; you need to use Information Wrangler for extra superior knowledge evaluation equivalent to characteristic significance, goal leakage, and mannequin explainability utilizing a simple and intuitive person interface.

Analyze knowledge high quality

Use the Data Quality and Insights Report to carry out an evaluation of the info that you just’ve imported into Information Wrangler. Information Wrangler creates the report from the sampled knowledge.

On the Information Wrangler movement web page, select the plus signal subsequent to Information varieties, then select Get knowledge insights.
Select Information High quality And Insights Report for Evaluation kind.
For Goal column, select your goal column.
For Drawback kind, choose Classification.
Select Create.

The insights report has a short abstract of the info, which incorporates common data equivalent to lacking values, invalid values, characteristic varieties, outlier counts, and extra. You possibly can both obtain the report or view it on-line.

Add transformations to the info

Information Wrangler has over 300 built-in transformations. On this part, we use a few of these transformations to arrange the dataset for an ML mannequin.

On the Information Wrangler movement web page, select plus signal, then select Add remodel.

When you’re following the steps within the publish, you’re directed right here mechanically after including your dataset.

Confirm and modify the info kind of the columns.

Trying by means of the columns, we establish that MNTHS_SINCE_LAST_DELINQ and MNTHS_SINCE_LAST_RECORD ought to most definitely be represented as a quantity kind reasonably than string.

After making use of the adjustments and including the step, you’ll be able to confirm the column knowledge kind is modified to drift.

Trying by means of the info, we are able to see that the fields EMP_TITLE, URL, DESCRIPTION, and TITLE will doubtless not present worth to our mannequin in our use case, so we are able to drop them.

Select Add Step, then select Handle columns.
For Rework, select Drop column.
For Column to drop, specify EMP_TITLE, URL, DESCRIPTION, and TITLE.
Select Preview and Add.

Subsequent, we need to search for categorical knowledge in our dataset. Information Wrangler has a built-in performance to encode categorical knowledge utilizing each ordinal and one-hot encodings. Taking a look at our dataset, we are able to see that the TERM, HOME_OWNERSHIP, and PURPOSE columns all look like categorical in nature.

Add one other step and select Encode categorical.
For Rework, select One-hot encode.
For Enter column, select TERM.
For Output type, select Columns.
Depart all different settings as default, then select Preview and Add.

The HOME_OWNERSHIP column has 4 attainable values: RENT, MORTGAGE, OWN, and different.

Repeat the previous steps to use a one-hot encoding strategy on these values.

Lastly, the PURPOSE column has a number of attainable values. For this knowledge, we use a one-hot encoding strategy as properly, however we set the output to a vector reasonably than columns.

For Rework, select One-hot encode.
For Enter column, select PURPOSE.
For Output type, select Vector.
For Output column, we name this column PURPOSE_VCTR.

This retains the unique PURPOSE column, if we determine to make use of it later.

Depart all different settings as default, then select Preview and Add.

Export the info movement

Lastly, we export this entire knowledge movement to a characteristic retailer with a SageMaker Processing job, which creates a Jupyter pocket book with the code pre-populated.

On the info movement web page , select the plus signal and Export to.
Select the place to export. For our use case, we select SageMaker Characteristic Retailer.

The exported pocket book is now able to run.

Export knowledge and prepare a mannequin with Autopilot

Now we are able to prepare the mannequin utilizing Amazon SageMaker Autopilot.

On the info movement web page, select the Coaching tab.
For Amazon S3 location, enter a location for the info to be saved.
Select Export and prepare.
Specify the settings within the Goal and options, Coaching methodology, Deployment and advance settings, and Evaluation and create sections.
Select Create experiment to search out the very best mannequin in your downside.

Clear up

In case your work with Information Wrangler is full, shut down your Data Wrangler instance to keep away from incurring extra charges.

Conclusion

On this publish, we demonstrated connecting Data Wrangler to Snowflake using OAuth, reworking and analyzing a dataset, and eventually exporting it to the info movement in order that it could possibly be utilized in a Jupyter pocket book. Most notably, we created a pipeline for knowledge preparation with out having to jot down any code in any respect.

To get began with Information Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.

Concerning the authors

Ajjay Govindaram is a Senior Options Architect at AWS. He works with strategic clients who’re utilizing AI/ML to unravel complicated enterprise issues. His expertise lies in offering technical course in addition to design help for modest to large-scale AI/ML utility deployments. His data ranges from utility structure to massive knowledge, analytics, and machine studying. He enjoys listening to music whereas resting, experiencing the outside, and spending time together with his family members.

Bosco Albuquerque is a Sr. Accomplice Options Architect at AWS and has over 20 years of expertise in working with database and analytics merchandise from enterprise database distributors and cloud suppliers. He has helped massive know-how firms design knowledge analytics options and has led engineering groups in designing and implementing knowledge analytics platforms and knowledge merchandise.

Matt Marzillo is a Sr. Accomplice Gross sales Engineer at Snowflake. He has 10 years of expertise in knowledge science and machine studying roles each in consulting and with business organizations. Matt has expertise growing and deploying AI and ML fashions throughout many alternative organizations in areas equivalent to advertising, gross sales, operations, scientific, and finance, in addition to advising in consultative roles.

Huong Nguyen is a product chief for Amazon SageMaker Information Wrangler at AWS. She has 15 years of expertise creating customer-obsessed and data-driven merchandise for each enterprise and client areas. In her spare time, she enjoys audio books, gardening, mountain climbing, and spending time along with her household and mates.