What is Snow flake ?

skiing-pair.jpg

(6 minute read)

Founded in 2012, Snowflake (aka Snow flake) is a cloud-based datawarehouse, founded by three data warehousing experts. Just ten short years later the company is valued at over $50 billion and is growing at a 100% rate per year. But what is Snowflake, as why is this data warehouse built entirely for the cloud taking the analytics world by storm?

Although not intended as a Snowflake data warehouse tutorial, this article will explain what is Snowflake, which platforms does Snowflake support, and the key aspects of this ground breaking technology.

Snowflake: The Multi-Cloud Data Platform

 
Snowflake Multi-Cloud Platform
 

Snowflake was first available on Amazon Web Services (AWS), and is a software as a service platform to load, analyse and report on massive data volumes. Unlike traditional on-premise solutions which require hardware to be deployed, (potentially costing millions), snowflake is deployed in the cloud within minutes, and is charged by the second using a pay-as-you-use model.

It is possible to register and create an account within minutes, which includes $400 of free credit which is enough to store a terabyte of data, and run an small data warehouse for nearly two weeks, on a system that will support a small team of developers.

In July 2018 Snowflake announced the launch on Microsoft Azure cloud platform. Essentially the exact same code base as AWS, this means customers have a choice of cloud platform, which is a significant advantage to large corporates as it enables a multi-cloud deployment strategy.

How does Snowflake Work?

Many incredible features are built into Snowflake, but the most remarkable is the ability to spin up an unlimited number of virtual warehouses (each effectively an independent MPP cluster). This means users can run infinite independent workloads against the same data without any risk of contention, as illustrated in the diagram below.

Snowflake Data Warehouse Architecture

In addition, each warehouse can be resized within milliseconds from a single node extra-small cluster to a massive 128-node monster. This means, users don’t have to put up with poor performance, as the machine size can be adjusted throughout the day to match the workload. In one benchmark test, I reduced the time to process 1.3 terabytes of data down from 5 hours to under 3 minutes.

Finally, in addition to scaling up for larger data volumes, it’s also possible to automatically scale out to support a massive numbers of users. The diagram below illustrates how the Snowflake multi-cluster feature automatically scales out and then back in during the day, and the user is only charged for the time the clusters are actually running.

Snowflake automatic concurrency management

Is Snowflake an MPP database?

MPP stands for Massively Parallel Processing and is a database architecture successfully deployed by Teradata and Netezza. Unlike traditional Symmetric Multi-Processing (SMP) hardware, which runs several CPUs in a single machine, the MPP architecture deploys a cluster of independently running machines, with data distributed across the system. In addition to the ability to handle massive data volumes, it supports a scale-out architecture, as additional nodes can be added to the cluster. However, this can take from hours to days to deploy.

EPP stands for Elastic Parallel Processing and was pioneered by Snowflake Computing. This uses a number of independently running MPP clusters connected to a shared data pool. This architecture has the advantage that new clusters can be started within seconds, to elastically grow or shrink resources as needed.

What are the three layers of Snowflake architecture?

The diagram below illustrates the layers in the Snowflake service:

1. Cloud Service Layer:  Is “the brains” of the operation. This provides connectivity to the database and handles infrastructure, transaction management, SQL performance optimisation, security and metadata.

2. Compute Services Layer:  Hosts a potentially unlimited number of virtual warehouses whereby each warehouse consistent of a cluster of database servers which executes SQL statements. Although the virtual warehouse consistent of CPUs, memory and SSD storage, this is purely a transient storage layer.

3. Cloud Storage Layer:  Provides an infinite pool of permanent data storage. All data is stored in the cloud storage and is automatically replicated to three separate data centres with provides a built in layer of disaster recovery.

Three Layers of Snowflake Database Architecture

Three Layers of Snowflake Database Architecture

The layers of the architecture work transparently to service end user SQL queries, although it is possible to start and suspend virtual warehouses manually.

How much does a Snowflake credit cost?

Snowflake compute resources are charged at a rate of $0.00056 per second for a credit on an on-demand Standard Edition platform. This works out at around $2.00 per hour for an extra-small virtual warehouse on AWS Europe. Snowflake only charges for compute time while the virtual server is running, and is applied on a per-second basis after the first 60 seconds.

Storage is charged separately as a pass-through cost from the underlying provider, and on AWS works out at around $23 per terabyte per month. This means it’s possible to store a 10 Terabyte data warehouse for around $230 per month. In reality, as Snowflake applies columnar compression on the data, it’s likely that storage will work out much cheaper on Snowflake than (for example) S3.

What SQL does snowflake use?

Snowflake supports a standard set of SQL, a subset of the ANSI standards 1999 and 2003. This means most SQL statements which currently execute against Teradata, Netezza, Oracle or Microsoft will also execute on Snowflake, often with no changes needed. Indeed, Snowflake includes a number of extensions to ensure SQL can be quickly migrated.

Is Snowflake a Data Lake?

The Data Lake architecture became popular as a method of storing massive data volumes in their raw form, rather than transforming and loading data in a data warehouse which inevitably leads to selectivity and consequent data loss. This architecture was traditionally deployed on Hadoop platforms as it often includes semi-structured and unstructured data which were challenging to handle on traditional relational platforms.

Unlike legacy data warehouses, Snowflake supports both structured and semi-structured data including JSON, AVRO and Parquet, and these can be directly queried using SQL. Unlike Hadoop, Snowflake independently scales compute and storage resources, and is therefore a far more cost-effective platform for a data lake.

As a result, many customers moving to a cloud-based deployment are implementing their data lake directly in Snowflake, as it provides a single platform to manage, transform and analyse massive data volumes. The ability to seamlessly combine JSON and structured data in a single query is a compelling advantage of Snowflake, and avoids operating a different platform for the Data Lake and Data Warehouse.

In his excellent article, Tripp Smith explains the benefits of the EPP Snowflake architecture, which can save up to 300:1 on storage compared to Hadoop or MPP platforms.

Why was the company called Snowflake?

Despite a long tradition of technology companies having non-tech names (for example Apple, Google and Amazon), Snowflake was not named by a marketing team. According to the founders, it was named because of their shared love of snow and skiing.

I was lucky enough to attend a meeting with the founders, where the French born founder Thierry Cruanesexplained in a full French accent how difficult it was to pronounce the name of his previous company, Oracle. At least now, he joked, people could understand “Snowflake”.

Snowflake Data Warehouse pros and cons

The advantages of cloud-based data warehousing have been extensively reviewed. The main advantages of Snowflake over traditional on-premise bases solutions are:-

  • Machine Size:  Is no longer an issue.  Unlike traditional systems that typically involve deploying a massive server with plans to upgrade a few years later, Snowflake can be deployed on a single extra-small cluster and scaled up and down as needed.

  • Disk Space:  This is no longer an issue.  Data storage from cloud providers is both inexpensive and practically infinite in size.

  • Security: Is baked into the system.  Snowflake includes many security features, including IP whitelisting, multi-factor authentication and AES 256 strong end-to-end encryption.

  • Disaster Recovery:  Is no longer an issue.  Data is automatically replicated across three availability zones and can withstand the loss of any two data centres.

  • Software Upgrades:  Are no longer required.  As Snowflake is provided as a software service, both operating system and database upgrades are silently and transparently applied.

  • Performance:  Is no longer an issue, as clusters can be resized on-the-fly to deal with unexpectedly high data volumes.

  • Concurrency:  Is no longer an issue, as each cluster can also be configured to automatically scale out to satisfy massive numbers of users, then scale back when no longer needed.

  • Tuning and Maintenance:  Is no longer an issue, as Snowflake supports no indexes, and aside from a few well documented best practices, there is no need to tune the database.  Built for simplicity, there's little requirement for DBA resources.

In terms of the disadvantages, there is not much to write out. Customers on legacy Oracle, Netezza, Teradata or IBM platforms will need to migrate to Snowflake, and this should be considered as part of an overall cloud strategy, otherwise, there are no significant drawbacks for a data warehouse platform.

Notice Anything Missing?

No annoying pop-ups or adverts. No bull, just facts, insights and opinions. Sign up below and I will ping you a mail when new content is available. I will never spam you or abuse your trust. Alternatively, you can leave a comment below.

Disclaimer: The opinions expressed on this site are entirely my own, and will not necessarily reflect those of my employer.

John Ryan

Previous
Previous

Oracle Vs Snowflake

Next
Next

Snowflake Database Administration