Rambling about database technologies: June 2012

There are a lot of operators in the query execution plan that we generally overlook( I do this for those operators which I have no clue about). Sometimes these operators prove out to be pivotal in the performance of the query and we are forced to take good care of them. Spool operators are one such category of operators that we generally don't pay much attention to. But the very existence of spools in your query plan hints(most of the times) that the query is not optimized and SQL Server was forced to use spools to make the best out of your query. In this post, I will talk about what spools are and when they are used. I will go into details of Eager spool, Lazy Spool and Row Count Spool and we will see an interesting problem- known as the Halloween Problem. In the next post I will talk about the remaining spool operators- Table spool, Non-Clustered Index Spool and Window Spool and few guidelines to avoid spools. Sounds interesting ? Let's dive in then.

The query optimizer uses many techniques to optimize the queries we write and to avoid any problems, so that the query performs better. The spool operators are good examples of how optimizer works on behalf of the developer to make sure the query performs well. Though there are many spool operators (five to be precise in SQL Server 2008/ six in Denali), they all share the fact that they save their intermediate query results on the TempDB database and use this temporary area to search a value. This is not an explicit #temp table, but a work table for operations within the processing necessary for a given query's behavior. The optimizer uses this process whenever it knows that the intermediate result is complex to calculate. So the computations are done once and the results stored in the temporary space are used for further operations. Also, the optimizer might go for spooling if it finds out that the density( the more dense a column is, more rows that column returns for a given query) of the column is very high.

Before getting started with different spool operators, I will talk about few terms. To start with let's understand Rewind and Rebind. A rewind is an execution using the same value that has already been calculated in the immediately preceding execution. A rebind is defined as an execution that needs a value that is still to be calculated. For a spool it is very important to distinguish between rewinds and rebinds. Showplan operators can be categorized into Blocking and Non-Blocking operators. A blocking operator is one which halts the query execution until all rows are read from the input. Only then the actions will be performed and result would be returned. The SORT operator is a blocking operator. It reads all the rows, sorts the data and then returns the ordered rows. Non-blocking operators read one row from their input and return the output for each row.

Eager Spool

Eager spool stores all the rows received from the preceding operator into a hidden temp table in TempDB. "Eager" means that the spool will read all the rows that were returned from the previous operator at one time. So that makes the eager spool a blocking operator as it fetches all the rows at one go and stores it in a temp table. If the operator is rewound and no rebinding is necessary, the spooled data is used instead of rescanning the input. If rebinding is required all the spooled data is discarded and the spool is rebuilt by rescanning the input.

To make it clear I will resort to storytelling, literally ! There were a bunch of DBAs, hardworking and dedicated as ever, who were sacrificing their night of Halloween and burning some midnight oil for the greater good. They were trying to give a 10 % hike to every employee in the organisation who earned less than $25000. So what's the big deal you might say. Its just a matter of one update. And update they did. But then something happened that nobody was able to believe. While for other employees it may have been proved a Halloween gift, it was no less than horror for those poor DBAs. They found that all the employees were earning more than $25000 after the update. This is what the kids know as the Halloween Problem.

Let's try to figure out what actually happened with the update. Wikipedia defines this problem as "the Halloween Problem refers to a phenomenon in database in which an update operation causes a change in the physical location of a row, potentially allowing the row to be visited more than once during the operation. This could even cause an infinite loop in some cases where updates continually place the updated record ahead of the scan performing the update operation." This is exactly what happened there. The update kept on updating the records unless each record satisfied the where clause.

Let's dive a little deeper. When we update a field which is a key in an index, first the index is used to fetch the data for a row and when the update is done the index is also updated with the new value. Now, consider the salary field as the updatable column. Now this is sorted in the index. Once the first value for salary is updated in the index, it becomes greater than the second. So it is moved back to the second place as the index is sorted. Now for the update, when the parent query wants the second row, again the index is used. As the index has already returned the first row it will now return the second row. But that was already updated!!!

SQL Server deals with the Halloween Problem using an Eager spool. So when the update query is run, it again calls for the index. But instead of fetching one row, it spools the entire table. Now for the second row, records are not read from the index but from the temp table that the spool has created. So there is no multiple updates on the same row and DBAs can now rest on Halloween nights. SQL Server query optimizer may also create eager spools in other scenarios, when it feels like data should be cached for further operations. So much for the eager spool, let's talk about the Lazy Spool now.

Lazy Spool

Being a member of the great Spool family, Lazy Spool is also destined to get data from some table and store in some temporary table in the TempDB. But, Lazy Spools work a little differently than Eager spools. Lazy Spools read data only when individual rows are required. That means a Lazy Spool reads data and stores the rows in temporary tables only when the parent operator asks for a row. So the difference with the Eager Spool is visible as an Eager Spool reads all rows at once into a temporary table. Again, not to say Lazy Spool is a non-blocking operator. The query optimizer uses the Lazy Spool operator to optimize some queries by avoiding reading the same values multiple times. Effectively, SQL Server works with one chunk of data in all operations( until the parent operator needs different data). Otherwise, it would be constantly fetching new data with each iteration of the parent operator.

Row Count Spool

Quoting from msdn, "The Row Count Spool operator scans the input, counting how many rows are present and returning the same number of rows without any data in them. This operator is used when it is important to check for the existence of rows, rather than the data contained in the rows. For example, if a Nested Loops operator performs a left semi join operation and the join predicate applies to inner input, a row count spool may be placed at the top of the inner input of the Nested Loops operator. Then the Nested Loops operator can determine how many rows are output by the row count spool (because the actual data from the inner side is not needed) to determine whether to return the outer row." That's all that I could have said to explain it but I have a sample query to show when a Row Count Spool might be used and then we will try to figure out why the query optimizer does that and how it helps with respect to our query.

Use AdventureWorks

select * from

Sales.SalesOrderDetail SOD1

where not exists( select 1 from Sales.SalesOrderDetail SOD2

where SOD2.modifieddate= '2001-07-01 00:00:00.000'

and SOD2.orderqty>10)

When we look at the query plan for this query we will see this,

So you can see that the plan is using the Row Count Spool operator to check the Exists sub query. The clustered index scan reads all rows in the table and passes the rows to the Nested loop operator to be joined with the subquery result. The Left Anti Semi join here means that only one side of the join will be actually returned to the select query. So, only the rows returned from SOD1 will be returned by the select. For each row in SOD1, the loop will see the Row Count Spool to check if the value exists or not. If SQL Server would not have used Row Count Spool for this query, it would have required to execute the subquery for each row returned by the parent query. So effectively, SQL Server uses Row Count Spool to avoid reading the SOD2 once for every row returned by the parent query. If you want to see the difference in query execution had SQL Server not chose to create a Row Count Spool, you might want to refer to Fabiano’s articles on Spool operators(links shared below).

I hope you enjoyed this tour of Spool Operators as much as I enjoyed learning about them. Though spools are one of the ways SQL Server query optimizer tries to tune our ill-written queries, they should also be seen as an opportunity to tune the query. I will talk about the remaining spool operators and few guidelines to avoid them in your query plans in the next article. Your comments and feedback are welcome as always.

You might want to go through the fabulously written articles by Fabiano,

http://www.simple-talk.com/sql/learn-sql-server/showplan-operator-of-the-week---row-count-spool/

http://www.simple-talk.com/sql/learn-sql-server/operator-of-the-week---spools,-eager-spool/

http://www.simple-talk.com/sql/learn-sql-server/showplan-operator-of-the-week---lazy-spool/

According to Books Online (BOL), many of the system tables from earlier releases of SQL Server are now implemented as a set of views in SQL Server 2005. These views are known as compatibility views, and they are meant for backward compatibility only. The compatibility views expose the same metadata that was available in SQL Server 2000. However, the compatibility views do not expose any of the metadata related to features that are introduced in SQL Server 2005. Therefore, when you use new features, such as Service Broker or partitioning, you must switch to using the catalog views.

For example, sysobjects  is just a system table in SQL Server 2000. However, in SQL Server 2005, there is no such system table named “sysobjects”, it is implemented as a system view, just as background compatibility view only. Therefore, when we use new features, we need to switch to using the catalog views, here is sys.objects. In SQL Server 2005, since many of new features are used and stored in the system database, so these two views (sysobjects and sys.objects) return the result will certainly different.

For more information, please visit the following links:

Compatibility Views (Transact-SQL): http://msdn.microsoft.com/en-us/library/ms187376(v=SQL.90).aspx

Mapping SQL Server 2000 System Tables to SQL Server 2005 System Views: http://msdn.microsoft.com/en-us/library/ms187997(v=SQL.90).aspx

User-Schema Separation: http://msdn.microsoft.com/en-us/library/ms190387(v=SQL.90).aspx

Rambling about database technologies

Wednesday, 20 June 2012

Spool Operators: Eager spool, Lazy spool and Rowcount Spool

Thursday, 7 June 2012

Difference between sys.objects and sysobjects