Rambling about database technologies: Spool Operators: Index spool and Window spool

In my previous article I talked about a few spool operators (Eager Spool, Lazy Spool and Rowcount Spool), now its time to talk about the remaining spool operators that the query optimizer uses to make our queries run faster. In this article I will talk about the Non Clustered Index Spool. Then I will talk about Window Spools which are used by a number of window functions being introduced in Denali. In the end, I will talk about the common scenarios SQL Server query optimizer employs spool operators and how can we avoid them by simply keeping in mind a few things. Let’s get the things rolling then.

Non Clustered Index Spool

Every DBA who has had the opportunity to do performance tuning knows the importance of indexes in SQL Server. Still we will see tables without indexes (or without proper indexes). The Non Clustered Index Spool is used to better the read performance of queries which access such tables which are not properly indexed (or indexed at all). When SQL Server encounters such queries, where it needs to read data from tables that are not indexed it might create a temporary index on the spooled data in the TempDB. The query optimizer then uses this index as any other non clustered index to boost the performance of the query. The Index Spool scans the input rows and places a copy of the rows in a hidden file stored in TempDB (just like any other spool) and builds a non clustered index on the rows. Now this hidden index can be used in the query for seek operations just like any other index is used. To better understand index spool operators I will show just one query and the execution plan that SQL Server query optimizer uses for the query,

USE AdventureWorks

SELECT *

FROM sales.SalesOrderDetail S1

WHERE S1.LineTotal > (

SELECT AVG(S2.linetotal)

FROM sales.SalesOrderDetail AS S2

WHERE S2.ModifiedDate < S1.ModifiedDate)

What I am doing with this query is pulling all the sales orders which have the order value (LineTotal) more than the average order values for all the orders that were placed before them. To figure out what the query optimizer thinks about this query, we will see the execution plan

So as you can see, the query optimizer thinks that it should create an index on the table to fetch all the data in the TempDB so it is going for creating an index spool(and that is taking about half of the time the query is using). So according to the query plan, first a clustered index scan is used to read all the data from the table and that data is being pulled by the index spool which then creates an index on that data in the tempdb. As the spool is an Eager spool( I talked about Eager and Lazy spools in my previous article) it will pull all the data at one go from the table scan. Let’s see if the query optimizer is smart enough to take his own decisions and to do that we need to check the existing indexes on the SalesOrderDetail,

So, the query optimizer is once again spot on as there is no index covering for either of the two columns (LineTotal and ModifiedDate) used in the query. The best thing about index spools are that they tell you exactly what you can do to remove them. So if we create a non-clustered index on the table covering the columns that are used in the query, we should see something interesting in the query plan,

create nonclustered index IX_Test on sales.SalesOrderDetail(ModifiedDate,linetotal)

Now if we run the same query again and see the execution plan we can see the index spool is gone. Now instead of going for a clustered index scan, the query optimizer goes for an index seek on the index we just created. Also, if you want to check the total query cost, it has also decreased by a substantial amount and the improvement will only get better once the table size grows as the query will never go for a complete table scan now.

Window Spool

This will be the last spool operator that I will discuss. Window spools are new to SQL Server just like a lot of window functions that are coming bundled with SQL Server 2012 aka Denali. Window functions are set of functions that allow us to work on a set of data that is known as the Window. Different database systems are offering a number of Window functions for quite some time now but SQL Server has generally missed out on these functions. There were some advances with SQL Server 2005 but after that we had to wait till Denali to see further substantial improvements in Window functions. But do we need them anyways? The answer would be yes as Window functions allow us to have a critical functionality of dealing with details within an aggregation. If we group by some field in a select query we lose access to the details within the group but sometimes we need at least some information about the groups that we lose because of aggregations. One such example would be to get the rank of salespersons per geographical region based on their targets achieved. Window functions let us do exactly that. They partition the data into different windows (in the example above, windows will be different geographical regions) and let us work on different windows of data. We will see just one example and then we will move to our spool operator,

USE TempDB

IF OBJECT_ID('Window_tbl') IS NOT NULL

DROP TABLE Window_tbl

CREATE TABLE Window_tbl (Value INT)

INSERT INTO Window_tbl(value)

values

(4),

(5),

(3),

(6)

SELECT Value,

LEAD(Value, 2) OVER(ORDER BY Value DESC) AS "Lead"

FROM Window_tbl

And the result is this,

So the Lead function tells us the value in the row that falls after the offset defined by the second parameter. So in the example above, as the offset is two it gives us the values that are separated by the current value by two rows. NULL values will be returned if there are no rows defined by the offset parameter. So for 4 and 3 there are NULL values as there are no rows which are separated by two rows from the rows having 4 and 3. Let’s see the execution plan for the code that we just ran,

So the query optimizer is using a Window spool. So SQL Server stores the frame data( or the Window data) in a temporary ‘worktable’ in TempDB. SQL Server actually uses two different ways to store the frame data. It can either go for a disk-based worktable or it can store all the frame data into an in-memory worktable. The difference being the in-memory worktable is much faster than the disk based worktable for obvious reasons. It goes for fewer scans/ logical/ physical reads than the disk based worktable. The question is can we control SQL Server to go for one particular type of worktable for the spools? Yes, we can. The in-memory worktable is used when we define the frame as rows and the frame contains less than 10000 rows. The on-disk worktable is used when the window functions are used with the default frame (i.e range) or the frame has more than 10000 rows.

OVER (
       [ <PARTITION BY clause> ]
       [ <ORDER BY clause> ]
       [ <ROW or RANGE clause> ]
      )

We can easily check which kind of worktable is being used by SQL Server. We can either use the xEvent ‘window_spool_ondisk_warning’ in SQL Server 2012 or we can set the statistics IO on and look at the output.

SET STATISTICS IO ON

SELECT Value,

LEAD(Value, 2) OVER(ORDER BY Value DESC) AS "Lead"

FROM Window_tbl

(5 row(s) affected)

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Window_tbl'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The count values for the ‘Worktable’ tell us that it has used an in-memory worktable. For a disk-based worktable you will see these counts to be much higher. Because of the performance differences, it is always good to deine the window as ROWS instead of RANGE.

Avoiding Spools

As with all performance tuning methods, there is no sureshot method to avoid spools but definitely there are some best practices that can help to a good extent. SQL Server might go for spools for queries using ‘Select into’ if the isolation level is read isolation. It can be avoided by lowering the isolation level to read committed. Also, use of NOLOCK hints should also help in such cases. (NOTE: Be careful before you change the isolation level of your systems). SQL Server might go for spools in user defined functions to ensure that no DDL change will break the ongoing operation of the function. This scenario can be avoided by using schema binding in functions. If the UDF is non-schema bound then the query optimizer will generate the spool operator. In the case of a UDF where it is not accessing any base table or view, such spooling will never be utilized. So in the case of the schema bound UDF this spool operation does not occur and performance is increased. Window spools can be avoided easily by defining the frame as ROWS instead of RANGE (provided the frame size is less than 10000 rows). Similarly, its pretty simple to avoid non clustered index spools. Mostly, you would be creating indexes covering the columns used in the query( other wise the optimizer will create the same index in tempdb known as spools). Reworking a query that used an OR in the where statement might help sometimes.

So this was an attempt to make the spool operators a little more understandable. I hope you enjoyed learning about different types of spools and ways to avoid them. Your comments and feedback are welcome as always.

Rambling about database technologies

Sunday, 15 July 2012

Spool Operators: Index spool and Window spool

2 comments: