Snowflake

Pranav Kumar
5 min readNov 18, 2022

To appreciate the beauty of snowflake it is necessary to stand out in cold.

Source — Google Images
  1. Snowflake is a cloud data platform.
  2. It’s a relatively new concept in terms of data warehousing. It came into existence in 2012.
  3. It is fast analytical database based on cloud.
  4. It is purposely built for cloud. It cannot run on-premise.
  5. You can host Snowflake on top of AWS, AZURE and GCP.
  6. It can manage both structured and unstructured data easily.
  7. It has the capability of challenging both traditional data warehousing platforms as well as big data platforms.
  8. Here, we are using cloud you don’t have of take care of maintenance, backup etc. You can just contact the support team for your convenience.
  9. Monitoring teams, Disaster Management, Scaling Up or Down, it is entirely done by Snowflake.
  10. We primarily focus on data while using cloud warehouse.

#)FEATURES

  1. Pay as you go — Only pay for the services which you are using. There is no license contract.
  2. Unlimited Scalability — When your workload increases in your environment to maintain your business. Here, we have auto scale to balance things out. Scalability is on demand.
  3. Support Diff. Data — It supports both structured and unstructured data. XML, AVRO, JSON, CSV, PARQUET all are supported by Snowflake.
  4. All of your users — All of your users means concurrency which means when we are connecting to snowflake at same time. Means multiple users, multiple workloads are connecting together running multiple queries, batch processing etc.
  5. Secure Views — Sharing can only be done on secure views and tables. You cannot see the structure of secure views but can see the data. You have full authority and access to make sure which one is having my data. You can also remove the members from sharing. You can also limit members for data sharing. While sharing the data, we can share specific table and if you want you can withdraw the permission for sharing data.
  6. Encrypted Data — The data is encrypted in Snowflake. No one else from outside can see your data. Eg:- Even if your data is stored in AWS S3 bucket but still you cannot access it from AWS. You have to be logged into your Snowflake account for accessing the data.

#)LOADING PROCESS — STAGING

  1. When data is loaded in Data Warehouse it is not directly loaded. Rather, data is put in a staging area.
  2. Staging area is a temporary place where data is stored before putting it into the target table.
  3. Staging area can be inside or outside of data warehouse.
  4. In Snowflake, such staging area are known as INTERNAL STAGE.
  5. Staging area can also be outside of data warehouse, it can be file server which in Snowflake is referred as EXTERNAL STAGE.

#)HYBRID ARCHITECTURE

Hybrid Architecture
  1. Snowflake uses a unique hybrid architecture that combines the share nothing and shared disk architecture.
  2. It uses a single copy of data similar to the shared disk architecture but also provides capability to run multiple compute engines on top.
  3. At it’s core Snowflake uses cloud platform storage which in case of AWS is S3 storage and in case of AZURE is BLOB storage.
  4. Users don’t have to worry about the format or management of data rather Snowflake will convert the data into the required column format before storing it.
  5. The backend storage is cloud based. So, the amount of storage available to Snowflake instance is virtually unlimited since the S3 has virtually unlimited storage.
  6. Here, compute and storage are independent of each other.
  7. The compute here is composed of one or more clusters consisting of compute nodes that are used to perform operations such as loading and querying data.
  8. These compute clusters are known as VIRTUAL WAREHOUSE.
  9. Virtual Warehouse doesn’t store any data of it in their compute nodes. This means that warehouses can be turned off when not needed as there is segregation b/w storage & compute.

#)HOW TO LOAD DATA INTO D.W.

  1. The data is first staged so that the D.W. platform can access it.
  2. Snowflake is cloud platform, so the data has to be staged at cloud itself.
  3. The loaded from data be loaded Can S3 BUCKET or AZURE BLOB.
  4. This Can be done using the COPY feature of Snowflake.
  5. We can also use TALEND, INFORMATICA ETL tools to transfer the data over ODBC — JDBC connectivity.

#)DATA LOADING OPTIONS

  1. BULK LOAD — Snowflake COPY command is used for batch loading. Batch loading of data which is already available in cloud storage or internal location. Copy command uses VIRTUAL WAREHOUSE compute resources which is managed virtually. Allows basic transformations such as re-ordering columns, truncating strings etc.
  2. CONTINOUS LOAD — Snowpipe is used for loading streaming data. Uses a serverless approach for scaling sup/down automatically. Don't uses virtual warehouse compute resources.
  3. EXTERNAL TABLES — Can use external tables to access data externally. This is useful only if lot a data is stored externally and you want to query only fraction of that data.

#)SNOWPIPE

  1. Snowpipe is a mechanism to enable loading of data soon as its become available in a stage.
  2. Using Snowpipe you can achieve micro-batched loading of data.
  3. It is used where data is arriving continuously like transactions or events.
  4. It is used in business where data has to be made available immediately.
  5. It uses server less architecture, so it will not use an virtual warehouse.
  6. It has it’s own processing & is billed differently.

#)HOW DOES SNOWPIPE WORK

  1. It contains a COPY statement which is used by snowflake to load the data.
  2. Snowflake may be triggered automatically or manually to load data.

#)PARTITION YOUR DATA

  1. Very large tables can be partitioned to improve performance.
  2. Snowflake maintains a table’s micro partitions without requiring a DBA’s involvement.
  3. For very large tables, the automatic partitioning scheme may not be the most optimal.
  4. You can create custom cluster keys to partition the table.
  5. Normally, we would use those columns in cluster keys which are frequently used in WHERE clauses or are frequently used in JOINS or ORDER BY statements.
  6. Cardinality of cluster columns can be a choice of cluster keys.

#)FAILSAFE

  1. FAILSAFE is part of continuous data protection provided by SNOWFLAKE.
  2. It is separate & distinct from time travel.
  3. It ensures that data is protected in the event of a failure.
  4. It provides 7 day period during which historical data can be recovered.
  5. The FAILSAFE PERIOD starts immediately after the time travel period ends.
  6. Recovers data in quick time.

#) ZERO COPY CLONING

  1. Creates a copy of database, schema or a table.
  2. Metadata Operation, doesn’t actually copy the data.
  3. SNAPSHOT of original data is made available to the cloned object.
  4. The cloned object is independent of original object.
  5. Modification to original object & cloned object are independent.

#) MICRO PARTITIONING

  1. Snowflake has a unique way of partitioning known as MICRO — PARTITIONING.
  2. It overcomes the limitations of static partitioning.
  3. Each micro partition can contain the size between 50 M.B. and 500 M.B. of uncompressed data. However, once this data gets loaded into the snowflake tables, the size of snowflake table will be much than normal table.
  4. Each micro-partition contains group of rows in a columnar storage.
  5. Snowflake stores metadata about all rows stored in a micro-partitions.

--

--