
In this series of blog posts. I will talk about query optimization in general. However, before doing so; In this blog post we will cover
These topics will lay the foundation for query optimization, will establish a common vocabulary and hopefully make reader motivate for subsequent parts.
In subsequent parts: I plan to cover following optimizations:
SQL (Structured Query Language) is a descriptive language that is used to interact with databases. An SQL query specifies the desired output by the user. However, it doesn’t specify how to calculate desired output during execution. As an example, following SQL query:
SELECT name, surname
FROM student_table
WHERE major='Mathematics'
ORDER BY id;
can be translated to english as
Bring me the name, and surnames of the each student who majors in Mathematics where result is ordered by student id.
However, it says nothing about how to accomplish this. For most of the queries, especially for complex queries there are many possible ways to calculate desired result with various tradeoffs.
From this specification, a query is pretty opaque to end user which is fine for most of the users who don’t and shouldn’t care. However, if we work on performance and optimization. We need to inspect internals of the query execution better. LogicalPlan and PhysicalPlan enables us to see what a query does, how it executes in a friendly format.
Before continuing, in case anyone wants to try out the same queries locally. Here is the query to generate necessary table locally.
CREATE TABLE IF NOT EXISTS student_table(id INT, name VARCHAR, surname VARCHAR, major VARCHAR, country VARCHAR)
AS VALUES(103, 'Mustafa', 'Demir', 'Mathematics', 'TUR'),
(101, 'Jack', 'Hu', 'CS', 'US'),
(105, 'Gerard', 'Depardieu', 'Mathematics', 'FR'),
(102, 'Steven', 'Smith', 'Mathematics', 'UK'),
(104, 'Ekaterina', 'Mihailov', 'Physics', 'RUS');
Also please note that I execute these queries using cli tool of Apache Datafusion. This is the system I am most familiar with, However results and corresponding queries should be (almost) same in other query engines (configuration options might be pretty different though).
A LogicalPlan contains the necessary logical steps to compute query result. However, this representation is in high level, and omits lots of details related to the actual execution. I like to think LogicalPlan more like pseudo code, where you can see the general picture without bogging down in detail or specificities. In contrast PhysicalPlan is more like implementation of a pseudocode in a specific language. Anyone knowing the internals of the query engine can understand what is going on (like reading a code) by reading the PhysicalPlan. However, for this blog post Logical Plan representation will be enough for our use cases. Let’s see theLogical Plan Datafusion generates for the query above by executing following command:
-- Generate only logical plan
set datafusion.explain.logical_plan_only = true;
-- See the plan for query
EXPLAIN SELECT name, surname
FROM student_table
WHERE major='Mathematics'
ORDER BY id;
where original query is prepended with EXPLAIN keyword and configuration is set logical plan only mode to not clutter result with physical plan for now. Command above produces following Logical Plan:
+--------------+-----------------------------------------------------------------------------+
| plan_type | plan |
+--------------+-----------------------------------------------------------------------------+
| logical_plan | Projection: student_table.name, student_table.surname |
| | Sort: student_table.id ASC NULLS LAST |
| | Projection: student_table.name, student_table.surname, student_table.id |
| | Filter: student_table.major = Utf8("Mathematics") |
| | TableScan: student_table projection=[id, name, surname, major] |
+--------------+-----------------------------------------------------------------------------+
A Logical Plan contains the necessary steps to produce the desired result according to specifications of the SQL query, where each step is represented as single line and executes from bottom to top. To better understand this representation, let’s look the equivalent Logical Plan as tree.

where each arrow show the data flow during execution. These two representations are completely same. In the first representation dependency between operators are specified by the indentation level (e.g The operator(s) with 2 indentation level ahead of an operator is the child(ren) of the operator). First representation is more concise, more text friendly. However, it maybe intimidating at the start. Second representation is more friendly, however fills up lots of space.
Logical Plan above describes following steps in order
id, name, surname, major are used.major is ‘Mathematics’.name, surname, id respectively, where column major is pruned also.name, surname respectively, where column id is pruned.To see what is going on better, let’s see the data that is passed between operators. At the source(this can be a file, a kafka topic, etc.) following table resides
| id | name | surname | major | country |
|---|---|---|---|---|
| 103 | Mustafa | Demir | Mathematics | TUR |
| 101 | Jack | Hu | CS | US |
| 105 | Gerard | Depardieu | Mathematics | FR |
| 102 | Steven | Smith | Mathematics | UK |
| 104 | Ekaterina | Mihailov | Physics | RUS |
TableScan operator emits following table to its output (where Column=country is pruned.)
| id | name | surname | major |
|---|---|---|---|
| 103 | Mustafa | Demir | Mathematics |
| 101 | Jack | Hu | CS |
| 105 | Gerard | Depardieu | Mathematics |
| 102 | Steven | Smith | Mathematics |
| 104 | Ekaterina | Mihailov | Physics |
Filter operator emits following table
| id | name | surname | major |
|---|---|---|---|
| 103 | Mustafa | Demir | Mathematics |
| 105 | Gerard | Depardieu | Mathematics |
| 102 | Steven | Smith | Mathematics |
where 2nd, 5th row in the input table is removed, since their major value is not Mathematics.
First Projection emits following table
| name | surname | id |
|---|---|---|
| Mustafa | Demir | 103 |
| Gerard | Depardieu | 105 |
| Steven | Smith | 102 |
where Column: major is removed, then columns are re-ordered.
Sort operator emit following table, where rows are re-ordered such that id column is Ascending.
| name | surname | id |
|---|---|---|
| Steven | Smith | 102 |
| Mustafa | Demir | 103 |
| Gerard | Depardieu | 105 |
Then, final Projection emits following table:
| name | surname |
|---|---|
| Steven | Smith |
| Mustafa | Demir |
| Gerard | Depardieu |
where id column id removed. This is the result seen at the output by the user. For the query
SELECT name, surname
FROM student_table
WHERE major='Mathematics'
ORDER BY id;
Datafusion emits following result:
+---------+-----------+
| name | surname |
+---------+-----------+
| Steven | Smith |
| Mustafa | Demir |
| Gerard | Depardieu |
+---------+-----------+
If we carefully track data passed between operators we can see that only absolutely necessary data is passed between operators which is good. To see the difference a query optimization have, let’s see an alternative valid plan that would generate the same result.
+--------------+------------------------------------------------------------------------------------+
| plan_type | plan |
+--------------+------------------------------------------------------------------------------------+
| logical_plan | Projection: student_table.name, student_table.surname |
| | Filter: student_table.major = Utf8("Mathematics") |
| | Sort: student_table.id ASC NULLS LAST |
| | TableScan: student_table projection=[id, name, surname, major, country] |
+--------------+------------------------------------------------------------------------------------+
Let’s examine data passed between operators for this hypothetical plan. As before source is following table:
| id | name | surname | major | country |
|---|---|---|---|---|
| 103 | Mustafa | Demir | Mathematics | TUR |
| 101 | Jack | Hu | CS | US |
| 105 | Gerard | Depardieu | Mathematics | FR |
| 102 | Steven | Smith | Mathematics | UK |
| 104 | Ekaterina | Mihailov | Physics | RUS |
and TableScan emits exact same table, which is:
| id | name | surname | major | country |
|---|---|---|---|---|
| 103 | Mustafa | Demir | Mathematics | TUR |
| 101 | Jack | Hu | CS | US |
| 105 | Gerard | Depardieu | Mathematics | FR |
| 102 | Steven | Smith | Mathematics | UK |
| 104 | Ekaterina | Mihailov | Physics | RUS |
Sort operator emits following table (where rows are ordered such that Column=id is ascending):
| id | name | surname | major | country |
|---|---|---|---|---|
| 101 | Jack | Hu | CS | US |
| 102 | Steven | Smith | Mathematics | UK |
| 103 | Mustafa | Demir | Mathematics | TUR |
| 104 | Ekaterina | Mihailov | Physics | RUS |
| 105 | Gerard | Depardieu | Mathematics | FR |
Filter operator emits table below:
| id | name | surname | major | country |
|---|---|---|---|---|
| 102 | Steven | Smith | Mathematics | UK |
| 103 | Mustafa | Demir | Mathematics | TUR |
| 105 | Gerard | Depardieu | Mathematics | FR |
where rows that do not have major=Mathematics (rows:{1,4}) are removed as before.
Then Projection at the end emits following data
| name | surname |
|---|---|
| Steven | Smith |
| Mustafa | Demir |
| Gerard | Depardieu |
which is the same result as before. Unlike previous case, this plan carries lots of unnecessary data between operators.
Let’s examine the problems in the second plan
Sort operator receives whole table, then reorders its rows. Since sorting is relatively costly, we want to minimize data entering sort as much as possible. After sorting, Filter operator filters out some of the rows that is sorted. In this case there was no point in sorting those rows in the first place. As comparison Sort in the first plan receives 3 rows, however in the second plan it receives 5 rows. If the Filter were to prune 90% of the rows, the difference would be more dramatic. With this observation, we want Filter operation to be as below as possible during the execution to not carry around some rows unnecessarily.Filter operator receives Column=id, Column=country However, Filter itself doesn’t use these columns (it requires Column=major for its calculation). Also this column is not asked by the end user in the query (where user wants to see only name, surname). This column is passed to subsequent operator only to be pruned by projection there (This might not be important in terms of computation. However, it is important in terms of memory usage, and data traffic. Table at the source could have 100s of columns).In this blog post, we have analyzed two Logical Plans to see the difference query optimization can make (In more complex queries difference can be much more dramatic). In subsequent posts we will analyze
with these optimization steps, we can convert following naive plan
+--------------+------------------------------------------------------------------------------------+
| plan_type | plan |
+--------------+------------------------------------------------------------------------------------+
| logical_plan | Projection: student_table.name, student_table.surname |
| | Filter: student_table.major = Utf8("Mathematics") |
| | Sort: student_table.id ASC NULLS LAST |
| | TableScan: student_table projection=[id, name, surname, major, country] |
+--------------+------------------------------------------------------------------------------------+
to its optimized version
+--------------+-----------------------------------------------------------------------------+
| plan_type | plan |
+--------------+-----------------------------------------------------------------------------+
| logical_plan | Projection: student_table.name, student_table.surname |
| | Sort: student_table.id ASC NULLS LAST |
| | Projection: student_table.name, student_table.surname, student_table.id |
| | Filter: student_table.major = Utf8("Mathematics") |
| | TableScan: student_table projection=[id, name, surname, major] |
+--------------+-----------------------------------------------------------------------------+
where unnecessary data is not carried between operators.
1 - Apache Datafusion Documentation
2 - Apache Datafusion Project Github Page