Query Optimization in Distributed Database Systems

Muhammad Haroon

MS170400594

Distributed Database Management System (CS702)

Virtual University of Pakistan, Lahore.

Abstract – Query Optimization refers to the execution of a query in earliest
possible time by consuming a reasonable disk space. A query execution plan is
generated before execution and the optimal operator tree is got in the search
space. Research with different examples is conducted in this paper by providing
up to dated information and example work.

1. INTRODUCTION

The global success of databases
especially at commercial level is very much dependent on the query
optimization. SQL queries run on a specified time but as we know, as Joins are
very useful in SQL queries on one hand, they are much expensive on the other
hand. A join and non-join query imparts a handsome difference in terms of time
cost. The query that works according to our requirement is not always useful
for us i.e. if data is updating on non-bearable time. We need to optimize query
for better performance. This is actually called Query Optimization. It becomes
more challenging if this is need to be done in distributed database systems where
there are many replications and fragments spread over different servers and
platforms.

1.1 Query
Processing

Query processing involves the
conversation and translation of high level SQL query into some low level
instructions that database engine can read and execute the query.

As shown in fig.1, the inputted SQL
query is first parsed by Query Parser and then translated by Query
Translator. It is optimized then and move towards Execution Engine
where required result is achieved.

Fig. 1.1: Query Processing

In distributed database system
environment, the query is broken down in different steps and then forwarded
towards the fragments for execution.

1.2 Query
Optimization

Executing a query by controlling and
limiting space and time refers to Query Optimization. Every query gives
same result but the query with best time and space average is what we need to
have and use professionally. Referring to fig 1.1, the step ‘Distributed Query
Optimizer’ is the step where optimization of query is performed. The Query
Execution Plan (QEP) is generated which is the plan to optimize query by
considering all parameters.

2. QUERY
OPTIMIZATION DESCRIPTION

In relational databases, first task
is to analyse the SQL query for its possible optimization. There are some cases
where query optimization becomes NP-Hard problem i.e. where number of relations
in query is not fixed. The second task is to determine the access strategy i.e.
sequential scan or index.

Sample join query is:

SELECT admin.*

FROM employee
e, admin a, student s, teacher t

WHERE e.id =
a.id AND a.id = s.id AND s.id = t.id

The execution plan for this query
will be:

Fig.2.1: Operator
tree for Join Query (A)

To execute join based query, related
tables must be located on same server or at least related fragments on same
server. With these, one must specify access strategy with redistribution prior
to the join.

The module that does query
optimization is called Query Optimizer.

2.1 Components
of Query Optimizer

Query Optimizer obtains query from
query translator after parsing from query parser and works on three components:

       
i.           
Search Space

     
ii.           
Search Strategy

   
iii.           
Cost Model

2.1.1 Search
Space

Search Space refers to all feasible
sets of operator tree for a join or cartesian product query.

Permutations of the join order in a
query imparts a meaningful effect on the execution plan of the query.

 

Fig.2.2: Operator
tree for Join Query (B)

In comparison of Fig.2.1 and
Fig.2.2, the query gives same result but there is a difference in join
permutation that leads to a cost difference between the two.

Search space is actually to having
all the possible and feasible permutation of operator tree of SQL query.

The same case occurs with the
cartesian product based queries. Similarly with the joins, the cartesian
product is also a costly functionality to use and there need to be very careful
to use this.

Fig.2.3: Operator
tree for Cartesian Product Query

Cartesian product also generates
very much permutations and acquire a large search space which requires very
carefulness dealing with this.

2.1.1.1 Limitations
of Search Space

Number of alternative operator trees
for a query by applying commutative and associative rules iswhich becomes un-economical in case of a complex SQL queries
including many numbers of relations and operators (joins or cartesian products)
because factorial elevates exponentially.

The other limitation or restriction
is to apply needless heuristics that generates the useless operator trees in
search space which is not recommended at all. For example, the possible
operator trees are having joins and cartesian operator trees whereas there may
be a case where cartesian product trees are not required.

Orientation of the operator trees is
also a limitation. An SQL query produces two operator trees Fig.2.1 and
Fig.2.2. The both trees have different orientations in which one is more
economical than other. In this way, there is a restriction over orientation of
operator trees. In distributed environment, the tree in Fig.2.2 looks more
economical due to its parallelism in joins.

2.1.2 Search Strategy

Query optimizer uses search strategy
to produce the best operator tree by applying the approach of dynamic programming.
As the dynamic programming is a deterministic approach, so the technique is to
produce operator tree step by step. Dynamic Programming uses breadth first
search (BFS) to calculate the best one and discard the rest to avoid the
wastage of search space.

On contrary, the Greedy Search
approach produces only one plan which is the best by not doing exhaustive
search. It uses depth first search (DFS).

Fig.2.4 shows the query optimizer
actions in deterministic strategy done by dynamic programming of the same query
we’ve discussed before. It progresses step by step by executing one join, then
second and so on. But this approach becomes expensive in case of too many
relations and operators.

 

Fig.2.4: Query
Optimizer Actions in Deterministic Strategy

2.1.3 Cost
Model

The model that describes the cost of
the query execution in the system. It is the estimated cost of a complete plan
for the operator trees in the search space.

2.1.3.1 Cost
Model in DBMS

Cost model in Database Management
System (DBMS) depend upon various factors that includes: Access Type, Selection
of optimal operator tree, CPU and I/O cost etc.

The access type refers to the method
of accessing data from database i.e. sequential scan or index based. Of course
the sequential scan requires more cost than index scan because in sequential
scan, there need to be full scan of the relations by looking up tuples one by
one that requires more time and space cost. On the other hand, the index based
scan does not require that much time and cost space.

The selection of optimal operator
tree is also very important in determining the cost. As we discussed in section
2.1, the operator tree can be generated by different means and all the operator
trees give the same result but it makes a lot of difference in their cost
calculation. This is why, the selection of operator tree means a lot in
calculating the cost.

The peripherals cost like CPU and
I/O cost also matters because these parameters are directly proportional to the
execution time of the system.

Whereas:

The total cost is accumulation of CPU Cost and the Input
Output Cost and the CPU works on the unit instruction cost by the number of
instructions.

 2.1.3.2 Cost Model in Distributed DBMS

Execution of query in distributed
database management systems is different from centralized database management
system. In distributed environment, an SQL query is sent to different fragments
and replicated segments for execution. There are two types of time consumptions
in distributed environment i.e. Total Time and Response Time. Total Time
is the sum of all time consumptions while executing the query by ignoring the concurrency
factor. Whereas the Response Time is the time that a user has to wait
for the result of a query by considering concurrency factor. Concurrency is one
of the most important factors while dealing with database.

A query is executed in various
phases. The operator trees in search space is split into various phases. Let there
is an operator tree T.

is the set of phases of operator tree T.

is one individual phase of operator tree T.

is the set of operations i.e. joins or cartesian products of
a phase .

The total time is the time of all
the operations while executing the query. In distributed environment, it is
stated as:

 

Total cost is very much like the
cost of centralized DBMS but there is an extra parameter i.e. communication cost
which is the cost of the distributed communication of sending and receiving
time from multiple fragments resided on multiple servers or platforms and that communication
cost is defines as the sum of query initialization and the transmission cost.

There are some pipelined operations
as well in a phase. These are such operations that are in waiting queue and
pipelines to be executed in response of some query result. The pipelined
operations play vital role while calculating the response time of a query
execution.

The response time is the summation of the execution time (time
to execute a query) and pipeline wait (time to deliver a phase) of all
operations in a query (i=1 to ).

The Execution Time is the
time that is combination of the time to execute an operation i and the
transmission time of getting some result and forward to some other process. It also
depends upon the selected algorithm. For example, in fig.2.1, the first join operation
is performed and rest are pipelined. Then the result of first join operation is
executed with the other join and so on. The transmission time is actually the
management of this trading. In distributed environment, it is the major
challenge to minimize this transmission time. With the reference to fig.2.1,
the execution time can be defined as:

This is an example formula for above mentioned query and
operator tree in fig.2.1.

In distributed DBMS, different
fragments and the replicated pieces of database are spread over multiple sites
and multiple servers so the query execution needs to go to every fragment to
check predicates and to get relevant data and of course there are some
communication links in between these sites and servers which result in
communication and transmission cost.

For example, there is a database
spread over four sites and there are some communication links in between the
sites as shown is fig.2.5.

Fig.2.5: Distributed
Database Fragments Structure over four sites

There are four sites site 1, site 2,
site 3 and site 4 with communication links w, x, y and z.

The total time of a query execution
over this structure will be:

And the response time will be:

This is how one can calculate the
costs in distributed environment.

CONCLUSION:

Query optimization is always been
under discussion and research for years. This is one of the important and
challenging task in database systems area. In addition to distributed environment,
it becomes more interesting and difficult too. Query processing is also somehow
complex with distributed DBMS and hence query optimization too. The understanding
of distributed system is required to tackle this issue and write about it.
Minimize the cost factor is actual challenge to deal with.

REFERENCES:

1 Ozsu M.T., Valduriez, P.
Principles of Distributed Database Systems. Prentice-Hall, 1999.

2 M. M. Zloof. Query-by-Example: A
Data Base Language. IBM Syst. J. 1997.

3 Hasan, W, Optimization of
SQL Queries for Parallel Machines.LNCS 1182,

Springer-Verlag, 1996.

4 Chaudhuri, S., Shim K.
Optimization of Queries with Userdefined Predicates. In Proc.

of VLDB. Mumbai. 1996.

5 Ganski. R.A., Long, H.K.T.
Optimization of Nested SQL Queries Revisited. In Proc.

of ACM SIGMOD. San Francisco, 1987.

 

Post Author: admin