There
are a lot of operators in the query execution plan that we generally overlook(
I do this for those operators which I have no clue about). Sometimes these
operators prove out to be pivotal in the performance of the query and we are
forced to take good care of them. Spool operators are one such category of
operators that we generally don't pay much attention to. But the very existence
of spools in your query plan hints(most of the times) that the query is not
optimized and SQL Server was forced to use spools to make the best out of your
query. In this post, I will talk about what spools are and when they are used.
I will go into details of Eager spool, Lazy Spool and Row Count Spool and we
will see an interesting problem- known as the Halloween Problem. In the next
post I will talk about the remaining spool operators- Table spool,
Non-Clustered Index Spool and Window Spool and few guidelines to avoid spools.
Sounds interesting ? Let's dive in then.
The
query optimizer uses many techniques to optimize the queries we write and to
avoid any problems, so that the query performs better. The spool operators are
good examples of how optimizer works on behalf of the developer to make sure
the query performs well. Though there are many spool operators (five to be
precise in SQL Server 2008/ six in Denali), they all share the fact that they
save their intermediate query results on the TempDB database and use this temporary
area to search a value. This is not an explicit #temp table, but a work table
for operations within the processing necessary for a given query's behavior.
The optimizer uses this process whenever it knows that the intermediate result
is complex to calculate. So the computations are done once and the results
stored in the temporary space are used for further operations. Also, the
optimizer might go for spooling if it finds out that the density( the more
dense a column is, more rows that column returns for a given query) of the
column is very high.
Before
getting started with different spool operators, I will talk about few terms. To
start with let's understand Rewind and Rebind. A rewind is an execution using the
same value that has already been calculated in the immediately preceding
execution. A rebind is defined as an execution that needs a value that is
still to be calculated. For a spool it is very important to distinguish between
rewinds and rebinds. Showplan operators can be categorized into Blocking and Non-Blocking operators. A blocking operator is one
which halts the query execution until all rows are read from the input. Only
then the actions will be performed and result would be returned. The SORT
operator is a blocking operator. It reads all the rows, sorts the data and then
returns the ordered rows. Non-blocking operators read one row from their input
and return the output for each row.
Eager
Spool
Eager
spool stores all the rows received from the preceding operator into a hidden
temp table in TempDB. "Eager" means that the spool will read all the
rows that were returned from the previous operator at one time. So that makes
the eager spool a blocking operator as it fetches all the rows at one go and
stores it in a temp table. If the operator is rewound and no rebinding is
necessary, the spooled data is used instead of rescanning the input. If
rebinding is required all the spooled data is discarded and the spool is rebuilt
by rescanning the input.
To
make it clear I will resort to storytelling, literally ! There were a bunch of
DBAs, hardworking and dedicated as ever, who were sacrificing their night of
Halloween and burning some midnight oil for the greater good. They were trying
to give a 10 % hike to every employee in the organisation who earned less than
$25000. So what's the big deal you might say. Its just a matter of one update.
And update they did. But then something happened that nobody was able to
believe. While for other employees it may have been proved a Halloween gift, it
was no less than horror for those poor DBAs. They found that all the employees
were earning more than $25000 after the update. This is what the kids know as
the Halloween Problem.
Let's
try to figure out what actually happened with the update. Wikipedia defines
this problem as "the Halloween
Problem refers to a phenomenon in database in which an
update operation causes a change in the physical location of a row,
potentially allowing the row to be visited more than once during the operation.
This could even cause an infinite loop in some cases where updates continually
place the updated record ahead of the scan performing the update operation."
This is exactly what happened there. The update kept on updating the records
unless each record satisfied the where clause.
Let's
dive a little deeper. When we update a field which is a key in an index, first
the index is used to fetch the data for a row and when the update is done the
index is also updated with the new value. Now, consider the salary field as the
updatable column. Now this is sorted in the index. Once the first value for
salary is updated in the index, it becomes greater than the second. So it is
moved back to the second place as the index is sorted. Now for the update, when
the parent query wants the second row, again the index is used. As the index
has already returned the first row it will now return the second row. But that
was already updated!!!
SQL
Server deals with the Halloween Problem using an Eager spool. So when the
update query is run, it again calls for the index. But instead of fetching one
row, it spools the entire table. Now for the second row, records are not read
from the index but from the temp table that the spool has created. So there is
no multiple updates on the same row and DBAs can now rest on Halloween nights.
SQL Server query optimizer may also create eager spools in other scenarios,
when it feels like data should be cached for further operations. So much for the
eager spool, let's talk about the Lazy Spool now.
Lazy
Spool
Being
a member of the great Spool family, Lazy Spool is also destined to get data
from some table and store in some temporary table in the TempDB. But, Lazy
Spools work a little differently than Eager spools. Lazy Spools read data only
when individual rows are required. That means a Lazy Spool reads data and
stores the rows in temporary tables only when the parent operator asks for a
row. So the difference with the Eager Spool is visible as an Eager Spool reads
all rows at once into a temporary table. Again, not to say Lazy Spool is a
non-blocking operator. The query optimizer uses the Lazy Spool operator to
optimize some queries by avoiding reading the same values multiple times.
Effectively, SQL Server works with one chunk of data in all operations( until
the parent operator needs different data). Otherwise, it would be constantly
fetching new data with each iteration of the parent operator.
Row
Count Spool
Quoting
from msdn, "The Row
Count Spool operator scans the input, counting how many
rows are present and returning the same number of rows without any data in
them. This operator is used when it is important to check for the existence of
rows, rather than the data contained in the rows. For example, if a Nested Loops operator
performs a left semi join operation and the join predicate applies to inner
input, a row count spool may be placed at the top of the inner input of
the Nested
Loops operator. Then the Nested Loops operator
can determine how many rows are output by the row count spool (because the
actual data from the inner side is not needed) to determine whether to return
the outer row." That's all that I could have said to explain it but I have
a sample query to show when a Row Count Spool might be used and then we
will try to figure out why the query optimizer does that and how it helps with
respect to our query.
Use AdventureWorks
GO
select * from
Sales.SalesOrderDetail
SOD1
where not exists( select 1 from Sales.SalesOrderDetail
SOD2
where SOD2.modifieddate= '2001-07-01 00:00:00.000'
and SOD2.orderqty>10)
When we look at the query plan for this query we will see
this,
So you can see that the plan is using the Row Count Spool
operator to check the Exists sub query. The clustered index scan reads all rows
in the table and passes the rows to the Nested loop operator to be joined with
the subquery result. The Left Anti Semi join here means that only one side of
the join will be actually returned to the select query. So, only the rows
returned from SOD1 will be returned by the select. For each row in SOD1, the
loop will see the Row Count Spool to check if the value exists or not. If SQL
Server would not have used Row Count Spool for this query, it would have
required to execute the subquery for each row returned by the parent query. So
effectively, SQL Server uses Row Count Spool to avoid reading the SOD2 once for
every row returned by the parent query. If you want to see the difference in
query execution had SQL Server not chose to create a Row Count Spool, you might
want to refer to Fabiano’s articles on Spool operators(links shared below).
I hope you enjoyed this tour of Spool Operators as much as I
enjoyed learning about them. Though spools are one of the ways SQL Server query
optimizer tries to tune our ill-written queries, they should also be seen as an
opportunity to tune the query. I will talk about the remaining spool operators
and few guidelines to avoid them in your query plans in the next article. Your
comments and feedback are welcome as always.
You might want to go through the fabulously written articles
by Fabiano,
No comments:
Post a Comment