Snowflake Zero-Copy Cloning: Accelerate the performance of cloning process.
In the real world scenario, where the ETL/ELT production database is getting synchronized from source at an interval of 15 mins & we need to create an exact replica of that production database several times based on business requirement for reporting or analytical purpose which is just a read only copy at specific time using the time travel. This replica referred to as a cloned database of the original database.
To understand more about the Zero-Copy Cloning click here.
Now with the above scenario, lets considered a sample database which consist of 20 schemas & 50 tables in it. To clone this kind of databases sometimes it takes more time (20+ mins) to complete based on the size of the objects in it.
So in this blog we will explore the solution to improve the performance of this cloning process. With the combination of Python & Snowflake SQL we can clone the individual schemas concurrently. This will result in all schemas being cloned with multiple tasks. Lets see this in-action.
Note: To showcase the demo I have considered the small database with few tables in it. In real world, overall size and number of objects will be more which usually takes long duration to create the clones.
Pre-requisites
This article assumes that the following objects are already configured:
— Snowflake Account with required access/grants to perfrom the clone.
— Python version 3.8 or later.
— Python package “ snowflake-connector-python “. Click here
Lets perform this cloning process at different levels and see the results.
1. Consecutive sessions.
2. Concurrent Sessions.
- Consecutive sessions
In this method we will clone the complete database with 20 schemas, 50 tables in each. Run the below code from the snowflake UI. This is the usual way of creating the complete Database clones in the Snowflake.
create database PROD_DB_WITHOUT_CONCURRENT_SESSION clone DEV_DB;
The above code took 11m 15s to complete.
Basically the Snowflake start creating the copy of the orignal database and all the objects (like schemas, tables etc) in sequential manner.
2. Concurrent sessions
In this method we will clone all the schemas concurrently by making Asynchronous calls to snowflake & executing the cloning SQL statements asynchronously using EXECUTE_ASYNC signature.
Create a python script file with below code & run it in the server or machine where python is installed & connection to snowflake is established.
Sometimes for security purpose Snowflake connection will be restricted to limited network Id’s or the machines, in that case this permission needs to be enabled, by default Snowflake allows connection from any device.
import snowflake.connector
from snowflake.connector import DictCursor
def clone_database(con, source_database, target_database):
con.cursor().execute(f"create or replace database {target_database};")
cursor = con.cursor(DictCursor)
cursor.execute(f"show schemas in database {source_database};")
for i in cursor.fetchall():
if i["name"] not in ("INFORMATION_SCHEMA", "PUBLIC"):
con.cursor().execute_async(f"create or replace schema {target_database}.{i['name']} clone {source_database}.{i['name']};")
con = snowflake.connector.connect(
user='*******',
password='********',
account='*******.us-east-1',
warehousename = 'COMPUTE_WH',
session_parameters={'QUERY_TAG': 'CONCURRENT SESSIONS',})
clone_database(con,"DEV_DB", "DEV_DB_WITH_CONCURRENT_SESSION") # Pass the required source & target database names
Note: In the above script, enter your Snowflake User Name, Password & Account Identifier. For more details about connection parameters click here
The above code took just 39 seconds to complete.
Basically the python script started reading the schema names available in the source database & making asynchronous calls to Snowflake to execute multiple cloning SQL commands concurrently.
Enhancements:
- The above process can be further deep dive to schema objects (like tables) and those objects can also be created concurrently which will further reduce the completion time. But it has its own challenges, like the rate at which queries can be dispatched or executed concurrently.
- We can use time travel method to take the schema clones at exact time
- In the python script I have hardcoded the Snowflake credentials, this can be secured by creating the configuration file with access credentials and restrict access to all the users. Or we can use Environment Varaiables to store and access configuration details.
To Summarize:
By running the CLONING scripts concurrently using “snowflake-connector-python” & making ashychronous calls to snowflake will improve the Database Cloning process at much higher rate.
Complete Results
Thank You for taking time & reading this article, I hope this article will help you in understanding the cloning process & how to improve the performance of it.
I have tried to explain the process in detail. I’d love to hear your thoughts and insights, so don’t hesitate to leave a comment.
About Me:
I am working as a Cloud Data Engineer and love to write about AWS & Snowflake and other cloud offerings. I have been working on SQL, ETL/ELT, DataWarehouse/BI & Cloud technologies like AWS & Snowflake.
I am AWS Certified Solution Architect & working on earning the Snowflake’s Pro Core Certification as well.
Follow me on Medium to catch up on new articles on AWS & Snowflake cloud offerings. Feel free to connect with me on LinkedIn !!!