Query Optimization¶

(From JianMin Wang's slides)

Introduction¶

We have so many alternative ways of evaluating a given query:

Defines exactly what algorithm is used for each operation, and how the execution of the operations is coordinated.

choose the cheapest plan based on the estimated cost.

The estimation is based on:

Transform through equivalence relational algebra expressions.

\(\sigma_{\theta_1 \wedge \theta_2} = \sigma_{\theta_1} \circ \sigma_{\theta_2} = \sigma_{\theta_2} \circ \sigma_{\theta_1}\)
\(\prod_{L_1}(\prod_{L_2}( \cdots(\prod_{L_n}(E)))) = \prod_{L_1}(E)\)
\(\sigma_{\theta}(E_1 \times E_2) = E_1 \Join_\theta E_2\)
\(\sigma_{\theta_1}(E_1 \Join_{\theta_2}E_2) = E_1 \Join_{\theta_1 \wedge \theta_2} E_2\)
\(E_1 \Join_{\theta} E_2 = E_2 \Join_{\theta} E_1\)
\((E_1 \Join E_2) \Join E_3 = E_1 \Join (E_2 \Join E_3)\)
\((E_1 \Join_{\theta_1} E_2) \Join_{\theta_2 \wedge \theta_3} E_3 = E_1 \Join_{\theta_1 \wedge \theta_3} (E_2 \Join_{\theta_2} E_3)\)
\(\sigma_{\theta_0}(E_1 \Join_{\theta} E_2) = (\sigma_{\theta_0}(E_1)) \Join_{\theta} E_2\)
\(\sigma_{\theta_1 \wedge \theta_2}(E_1 \Join_{\theta} E_2) = (\sigma_{\theta_1}(E_1)) \Join_{\theta}(\sigma_{\theta_2}(E_2))\)
\(\prod_{L_1 \cup L_2}(E_1 \Join_{\theta} E_2) = (\prod_{L_1}(E_1)) \Join_{\theta}(\prod_{L_2}(E_2))\)
\(\prod_{L_1 \cup L_2}(E_1 \Join_{\theta} E_2) = \prod_{L_1 \cup L_2} ((\prod_{L_1 \cup L_3}(E_1)) \Join_\theta (\prod_{L_2 \cup L_4}(E_2)))\) (pushdown production)
set operations
\(\sigma_{\theta}({}_G \gamma_A(E)) = {}_G\gamma_A(\sigma_\theta(E))\)
full outer join is commutative

Usually used ones:

pushdown something, reduce size earlier

Assume that data are uniform distribution in \([\min(A,r), \max(A,r)]\)

Selectivity \(s_r\) under these operation can be calculated by simple production rule.

\(V(A,r)\) is count of distinct values in attribute A of r

\(R \cap S = \emptyset\) : 直接乘起来就行
\(R \cap S\) is a super key of \(R\)(\(S\)) : then the maximum number of join is just the size of \(S\)(\(R\))
\(R \cap S\) is not any key for \(R\) or \(S\), we use ,the estimation of \(\frac{n_r \times n_s}{\max{(V(A,s), V(A,r))}}\)

Use Dynamic Programming to minimum join operation's cost (like matrix chain multiply)

Rules:

Perform selection early (reduces the number of tuples)
Perform projection early (reduces the number of attributes)
Perform most restrictive selection and join operations (i.e. with smallest result size) before other similar operations.