Redshift distribution keys

7/27/2023 0 Comments

Redshift distribution keys

Now (nearly) all queries have to perform some amount of data redistribution of data when executing. Since slice performance of Redshift is high, execution skew OFTEN isn't the #1 factor driving performance.

Disk skew can lead to issues and in some cases this CAN outweigh redistribution costs. How much is dependent on many factors but especially the query in question. Since the first step of every query is for each slice to work on the data it "owns", disk skew leads to some amount of execution skew. The bigger impact of disk skew is execution skew - the the difference in the amount of work each CPU (slice) does when executing a query. Move very large amounts of data between nodes is an anti-pattern for Redshift and should be avoided whenever possible.ĭisk skew is a measure of where the data is stored around the cluster and without query-based-information only impacts how efficiently the data is stored. Redshift is a networked cluster and the interconnects between nodes is the lowest bandwidth aspect of the architecture (not low bandwidth, just lower than the other aspects). The joins are usually on the raw tables or group-bys are performed on the dist key.The other tables are also distributed on the join-on key.The distribution (disk-based) skew is on a major fact table.As John rightly says you LIKELY want to lean towards improving join performance over data skew but this is based on a ton of likely-true assumptions.

0 Comments

YOUR CART

Redshift distribution keys

Leave a Reply.

Author

Archives

Categories