Sunday, February 16, 2020

Data Platform Tips 81 - Azure Synapse Analytics - Distribution Strategies

Azure Synapse Analytics (formerly known as Azure SQL Data Warehouse) uses three ways to distribute data. A distribution is the basic unit of storage and processing for parallel queries that run on distributed data. When SQL Analytics runs a query, the work is divided into 60 smaller queries that run in parallel using massively parallel processing (MPP) engine.

  • Round robin
  • Hashing
  • Replication

Round robin: Data is spread evenly among all the 60 underlying distributions without any regard to data values. There is no specific key used to distribute data.

Hashing: Most common and effective data distribution method. Data is distributed according to the most common distributed according to the value of a selected column.
Distributed table

Replication: Data is copied to every database rather than being split up. Replication is good small lookup tables or dimension tables that are frequently joined with other big tables.
Replicated table

No comments:

Post a Comment