Abstract | In recent years, a variety of complex synopsis structures have been proposed to capturestatistical correlations present in database relations. The goal of that work has been to improve
cardinality estimation during query optimization by improving statistics on base tables. In order
to use the correlation information for multi-table estimations as well, synopses must also be
constructed on intermediate result tables during query optimization. We address that problem
here, and show how to efficiently integrate multi-dimensional histograms and other complex
synopsis structures into the search algorithm of a traditional query optimizer such as that of
System R.
A main challenge in this work is minimizing the computational expense of dynamically
computing synopses on intermediate tables during query optimization. We show that one only
needs to construct a very small subset of all possible synopses to get the full estimation bene-
fits of multi-dimensional modeling. We also show that choosing this subset unwisely can have
unacceptable performance overheads. This leads to an “estimation planning” problem that has
been ignored to date in literature: how to choose the most efficiently-computable collection of
intermediate synopses to construct, so that all the required cardinality estimates can be com-
puted as accurately as possible from the base-table synopses?
In this paper, we identify and formalize the estimation planning problem in the context of a
System R-style optimizer extended with complex synopsis techniques such as multi-dimensional
histograms. We propose algorithms that find very good estimation plans efficiently, and discuss
properties of synopsis structures that make them amenable to efficient use in an optimizer.
|