By Younes Soliman
03-Jan-2021
Snowflake has put itself so far on the map of cloud datawarehouses even though other used-to be on-prem analytics data management solutions (data warehouses) have also gone the cloud way. We can see on the magic quadrant: MS SQL (or Azure Synapse), Oracle, SAP, Teradata, Amazon Redshift, Google Bigquery and others (Names are not orderd by any sequence) and we can see Snowflake as one of the challengers. So what makes Snowflake attractive to some companies? Well may be because of its flexibility of ingesting structured and semi-structured data, connectivity with many of the used tools in the market, SQL compatibility while supporting python, Node.js, JS and other languages and may be other reasons..Noting that other solutions might have those features as well as I'm not making a comparison here. The purpose here is just to list one of the solutions that could be used for data loading.
In case you are using Snowflake, and you don't want to use a third party tool for the data integration (ELT/ETL), here is one of the data ELT solutions that you can implement:
Let's suppose you want to integrate source data from different on-prem or cloud databases, the non-realtime solution can be divided into steps.
1- Scheduled export of new data/ changed data into one of the cloud storage solutions like Aws S3 or Azure or Google, or even to Snowflake staging area. You can create a staging area on snowflake and use it for files data loading. Depending on the use case and loading mechanism, yuo could choose between different types of stages that Snowflake provide. https://docs.snowflake.com/en/user-guide/data-load-local-file-system-create-stage.html
To upload data:
1.a You could use SnowSQL installed on a VM or EC2 or even use a lamda function which extracts and uploads data, or any similar mechanism. Below is an example of SnowSQL connect and "Put" commands
$ snowsql -a <accountName> -u <userName>`
$ PUT file:///folder/abc*.csv @db.schema.%abc;
1.b You could use as well a python script which has a python connector to Snowflake to put the file into the cloud storage.
import snowflake.connector
con = snowflake.connector.connect(
user='XXXX',
password='XXXX',
account='XXXX',
session_parameters={
'QUERY_TAG': 'EndOfMonthFinancials',
}
)
conn.cursor().execute("PUT file:///folder/abc*.csv @db.schema.%abc;")
2- After ensuring that files are uploaded to the stage/cloud. You can use one of several ways to have the data copies to snowflake tables.
2.a- You can create what is called a Snowpipe which automatically checks uploaded files with specified criteria and automatically uploads them to specified tables. Knowing that Snowpipes might incur high cost and may be mostly used for real-time data. Snowpipe can check data in table staging/internal snowflake staging/ AWS/Google/Azure and load the data from there.
here are more details about Snowpipes
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro.html
2.b- Instead of Snowpipes, you could use "Copy" command from SnowSQL to load specified files to the specified tables.
Here is an example of the "Copy" command
COPY INTO abc
FROM @%abc
FILE_FORMAT = (type = csv field_optionally_enclosed_by='"')
PATTERN = '.*abc[1-5].csv.gz'
ON_ERROR = 'skip_file';
In addition to SnowSQL, you could use a python script with snowflake connector to load the data from the file to the table.
conn.cursor().execute("COPY INTO abc
FROM @%abc
FILE_FORMAT = (type = csv field_optionally_enclosed_by='"')
PATTERN = '.*abc[1-5].csv.gz'
ON_ERROR = 'skip_file';")
3- Data transformation can take place after the loading, using either
3.a Snowflake sprocs and SQL/JS
3.b or by using another connector from python, or any other supported language.
So with that using either SnowSQL or python automated script, you can combine both the "put" and the "copy" commands into a single scheduled script. Noting that Copy uses bulk loading from either compressed or un-compressed files and it is a fast loading solution.
In case of the near real-time data integration, let's say from social media, one of the solutions could be to stream the data into Aws Kinesis firehose or any similar services and put S3 as a destination. Snowpipes will then take care of the data in the cloud and load it into corresponding tables in near-realtime.
The options listed above are for those who don't want to use third-party tool as I mentioned,and use open-source methods instead. However Snowflake does support third party tools like Talend, Informatica, Alteryx,..and many others.