Working at a big retail client in the UK, we had a problem where had some very sensitive data in S3 buckets, and we needed to move that data into SnowFlake without a person having access, and being able to see that data.

Many people drew big diagrams including multiple Lambdas, SQS, SNS and other AWS features to make this work, however we started looking at a new feature called “Auto Ingest”.

This is how we set it up.

Setup

The setup for this may seem straightforward, however there are some constraints within our organisation’s configuration that makes this somewhat more complicated. Therefore this guide serves to detail how to set up an Auto Ingest pipeline.

Firstly, it’s important to conceptualise how this works. In this setup, we are actually using the S3 bucket as a staging area. It’s referred to as an “external stage”.

We then create notifications via SQS from that bucket to a table using a pipe.

Table

To create the table, we first need to run:

12345USE SCHEMA GDPR_DATA; CREATE OR REPLACE TABLE TestGDPRTable(DATA variant);

You can then confirm the table was created with

1SELECT * FROM GDPR_DATA.TestGDPRTable;

Stage

We then need to create the stage. This is essentially telling Snowflake to use an S3 bucket as a stage:

1234CREATE OR REPLACE STAGE TestGDPRStage url='s3://gdpr-data-holding/'credentials = (aws_role = 'arn:aws:iam::0000:role/snowflake_access_role')encryption=(type='AWS_SSE_KMS' kms_key_id = 'aws/key');

The arn here is for a role in our AWS account containing the bucket. The role doesn’t have to actually exist yet. You can confirm the stage was created with

1SHOW PIPES;

Note this stage wont actuallywork until we have configured security, later on. If you are unable to create it at this point, wait until you get to “Create A Role” section.

Pipe

We now create a pipe that transfers data from the stage to the table:

1234CREATE OR REPLACE PIPE GDPR_DATA.TestGDPRPipe auto_ingest=true ascopy into GDPR_DATA.TestGDPRTable from @GDPR_DATA.TestGDPRStage file_format = (type = 'JSON');

As with the stage and table, we can confirm the pipe has been created with:

1SHOW PIPES;

S3 Notification

At this point we need to create a notification on the S3 bucket to one the aforementioned Snowflake Stage. To do this log into AWS, go to your S3 bucket and under Properties configure an Event Notification:

Name: Name of the event notification (e.g. Auto-ingest-UseCaseName).

Events: Select the ObjectCreate (All) option - we will need to use prefix processed_<usecaseName>

Send to: Select SQS Queue from the dropdown list.

SQS: Select Add SQS queue ARN from the dropdown list.

SQS queue ARN: Paste the SQS queue name from the SHOW PIPES output.

Security

Theoretically we can follow the guide to setting up security from Snowflake – https://docs.snowflake.net/manuals/user-guide/data-load-s3-config.html – However this doesn’t fully work, as even users with admin access on an account in our organisation do not have iam:UpdateAssumeRolePolicy permissions.

Therefore we must do the following, inside the AWS account with the bucket you are trying to use:

Create a Policy

Under IAM create a policy named snowflake_access with this JSON policy document:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
              "s3:PutObject",
              "s3:GetObject",
              "s3:GetObjectVersion",
              "s3:DeleteObject",
              "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<bucket_name>/<prefix>/*"
        },
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::<bucket_name>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "mydata/*"
                    ]
                }
            }
        }
    ]
}


If you don’t use a prefix, you can omit the prefix from the resource name and completely remove this section

"Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "mydata/*"
                    ]
                }
            }

Create A Role

Now we create a role for the snowflake user.

  • The type of trusted entity should be another AWS account. Snowflake specifically should have account ID: 520339422657
  • You must request an external ID, and for now we must use: 0000
  • When attaching a policy, ensure you attach the policy created in the step above.
  • The name you use for this must match the name of the role set in the stage above, so in this case “snowflake_access_role” will generate a role with this ARN:
arn:aws:iam::0000:role/snowflake_access_role

At this point if you were unable to create a stage earlier on, run the SQL for CREATE STAGE.

Update Role

Now we must update the role with details from the STAGE:

1desc stage TestPIIStage;

You would now normally edit the Trust Relationship of the role mentioned above, setting the values of:

  •  sts:ExternalId to the value listed as AWS_EXTERNAL_ID in the describe stage command
  • AWS under Principal to the value of SNOWFLAKE_IAM_USER

 However, even with Admin access you won’t be able to do this, therefore the best way is to create a new role, this time replacing the external ID of 0000 with the value of AWS_EXTERNAL_ID.

According to the guide, we should also change the sts:ExternalId, however you can’t do this using the wizard, and therefore it will grant the principle to root. This will, however, still work. You will then have a trust relationship policy document like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::520339422657:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "ORG_SFCRole=123_xxx"
        }
      }
    }
  ]
}

Finally

At this point you should be good to go. You may need to re-run the create pipe command, as you have recreated your stage.

Note that you SHOULD NOT re-create your stage, else you will have to go through the process of applying a role policy again, with the new output from that stage.

You can attempt to list all the files in your S3 bucket:

1list @TestGDPRStage;

You can view the status of your pipe with:

1SELECT SYSTEM$PIPE_STATUS( 'GDPR_DATA.TestGDPRPipe' );

The pipe will only pull files into the table from your stage if there is an SQS Notification. This means that at the point of creating all this, any files already in the bucket won’t be automatically pulled into a table. You can, however, run a command to start the process off for existing files using:

1ALTER PIPE GDPR_DATA.TestGDPRPipe REFRESH;

In my experience, it can take a fairly long time (1+ hour) for even small files to appear in the table. After a period of waiting time, you can check that they have appeared with:

1SELECT * FROM GDPR_DATA.TestGDPRTable;

Good luck, and please let me know if this helped!

Additional Info

For our work, we simply wanted to get the JSON data from a file in S3 into a single column of a table. At this point we’d grant access to the table we created to another team, so they can move the data into what table/schema/format they wish using snowflake’s PARSE_JSON() method.


2 Comments

Mike · October 4, 2019 at 8:49 pm

Very helpful! One of the best SF Snowpipe tutorials online. Thank you!

    Wayne · October 5, 2019 at 9:41 am

    Glad it was useful Mike!

Leave a Reply

Your email address will not be published. Required fields are marked *