Rambling about database technologies: Indexes

Showing posts with label Indexes. Show all posts

Wednesday, 22 August 2012

Columnar Databases and SQL Server Columnstore Indexes

Lately I have been working on Vertica - a column based database from HP and I have to admit that I am amazed by its capabilities and the functionalities it offers, especially when we are talking about relatively big data. The basic difference between SQL Server and Vertica is the way they store data- SQL server being a row store database and Vertica being a column store database. With Denali, SQL Server has also come up with something on the similar lines with ColumnStore indexes (codenamed Apollo). In this article I will talk about columnar databases, how they are different from the row store databases that we use everyday and why do we need them. Also, I will try to figure out what SQL Server has to offer with Apollo and how it stands up to any other column store database.

Column store databases

A column store database is a database which stores table data as sections of columns of data rather than as rows of data. That means all the values for one particular column will be stored together and these sets of data might be distributed for different columns. Different column values for same row are linked by pointers within. When any query is run against a row store database, it fetches data from all the columns for the rows that match the query conditions. In contrast, when the same query is run against a column store database, only those columns are read which are required by the query. So if you consider a table with 30 columns and your query needs( this is often less than 15% of the columns in a typical fact table, according to a Microsoft whitepaper) only five of those fields the column store database will at least give you 6x performance over its row store counterpart (provided the data in each column is similar). Considering the fact that disk access is still the slowest operation for our computers, this becomes significant because of much lesser pages to be read from the disk. Also, the columns which are not used in the query are not loaded at all in the memory which becomes essential to avoid frequent PLE drops when dealing with large amount of data.

Also, most of the column store databases use high values of compression for the data they store. The extremely high values of compression are results of the redundancy we usually see in columns. Again, the values in the same column are invariably of the same type, unlike the data you will see in a row, thus facilitating even more compression. Most of the column store database systems use more than one type of compression schemes based on the type of values stored in the column. For instance, Vertica uses Run Length Encoding schemes to store values in a column that are highly repetitive. These encoding schemes allow the database system to compress each column separately based on the underlying data values which can’t be done in a row store database because of different types of values stored in each row for each column.

So if column store databases are so good, why do we use row store databases at all? Moreover, why row store databases are ruling the market when they are so inept when compared with column store databases? The point is that row store databases have their own advantages. You can’t use a column store database for a small table with hundreds of OLTP transactions per minute. A row store database will be much faster than a column store database if you want to see all the columns of the table in your resultset (because different columns are stored at different places on the disk and need multiple disk seeks). In a nutshell, there are specific use cases where you should use column store databases. The rise of column store databases has a lot to do with the rise of BI and big data. So generally column store databases should be used where traditional row based databases fail in delivering a satisfactory performance. Usually, such cases will be when you are trying to run analytic queries on significantly big data or when you need only few columns of a large table with a number of columns (a fact table for a star schema perfectly suits the description). Row store databases fail because they are just not designed for the purpose.

ColumnStore indexes in SQL Server

Now that you have a basic understanding of what a columnstore is, let’s talk about this new feature of Denali which is known as ColumnStore Index (codenamed Apollo). A columnstore index stores each column in a separate set of disk pages as compared to a heap/ B-tree which store multiple rows per page.

It brings the obvious advantages of a column store, easily compressible data and fetching only the data that we actually need in the query, to a normal SQL Server table. The columnstore index in SQL Server employs Microsoft’s proprietary VertiPaq technology. SQL Server columnstore indexes are pure column stores. That means that the data is stored and compressed in column-wise fashion and individual columns can be accessed separately from other columns. Another type of column store is hybrid column stores which stores data as set of rows, but within that set of rows, data is organized and compressed in column-wise fashion.

Once you create a columnstore index on a table the query optimizer will choose to use a heap/B-tree or the index created by you based on the query and the cost of the query. If the optimizer chooses the columnstore index, but you feel that the underlying B-tree can perform better, you can always query hints to use the B-tree instead of the columnstore index. Remember that SEEKS are not supported by columnstore indexes. That means, if you use the table hint FORCESEEK, the SQL query optimizer will not consider the columnstore index and will rather go for the underlying B-tree or heap. Also, SQL Server columnstore indexes can use batch mode processing to even better the performance of the query.

While the performance benefits of columnstore index is quite nice, one can argue using alternative methods to achieve the same functionality, namely pre-computed summary tables or indexed views. These alternatives will surely work but they don’t offer the flexibility that columnstore indexes do. Pre-computed summary tables will approximate the effect of a columnstore but they will not work you have to change your query. Say, even if you have to add just one aggregate function to the query, you will need to change the structure of the summary table. Having a columnstore index on relevant columns will give you the flexibility to run any query with any aggregate function without changing anything in the base table. Also, if you don’t have a fixed set of queries to be run it can become very difficult to maintain pre-computed tables. Another alternative can be to use covering indexes. In most cases the columnstore index will be much more compressed that the covering index. If the query is too selective the covering index might work faster than the columnstore index (keep in mind that column stores are always better for big data and almost always poorer for OLTP queries). Also, columnstore index can use batch mode of processing to substantially speed up the query.

Limitations on using a columnstore index

Now that you know the advantages of columnstore index in SQL Server, it’s high time you know the flip side of the coin. SQL Server columnstore index comes with its own set of limitations, and unfortunately there are a lot of them. Among the numerous limitations the most glaring one is the inability to update a table with a columnstore index. Microsoft actually proposes some workarounds to achieve this in the whitepaper on columnstore indexes but this can definitely turn out to be the single biggest factor that actually stops you from using a columnstore index. A columnstore index can’t be clustered though Microsoft says that it will change with the future releases of SQL Server. Also, you cannot modify a columnstore index using ALTER INDEX statement. If you want to do so, you must drop and recreate the columnstore index. You cannot do a custom sort in a columnstore index by using ASC/DESC because columnstore indexes are ordered according to the compression algorithms. You cannot combine columnstore indexes with either replication or change data capture. As columnstore indexes already use compression you cannot combine it with page or row level compression offered by SQL Server. There are some more limitations that you can refer to in the whitepaper or on the msdn page for columnstore indexes.

I agree that the excitation created by use of column stores in a mainstream row store database is somewhat dampened by the numerous limitations that comes bundles with it but nonetheless it is a great initiative by Microsoft to try and come up with something meant for big data. The future looks destined for huge growth in the amount of data that we deal with and that will only prompt the growth of columnar databases. Right now SQL Server columnstore indexes stand nowhere when compared to full columnar database in terms of functionalities and performance but at least the guys at Redmond know what they are doing and we can definitely hope for a plethora of improvements in columnstore indexes in the future SQL Server releases. Also, I am looking forward to a more direct and tighter integration of columnstore indexes and SQL Server Analysis Services. I hope you enjoyed reading this little article. Your comments and feedback are welcome as always.

Wednesday, 25 April 2012

Indexes in SQL Server

Introduction

When we talk of enhancing performance of our database management systems (DBMS), the first thing that comes into focus are the indexes. It would not be an exaggeration to say that indexes are the single biggest behind-the-scene player responsible for a DBMS performing well. In this article, I will explain the basics of indexing in SQL Server 2008.

Indexes

An index in SQL Server can be visualized as an index in any book. An index in a book contains list of topics and page numbers in the book, which are arranged alphabetically for a quick reference. A database index has an ordered list of values with pointers to the row where the value and its corresponding data reside. For a table without indexes, the query engine would have to search the whole table for any query. This is similar to reading the whole book from the start to get to a topic. Succinctly, an index can tremendously reduce the query time for your tables.

In SQL Server, there are two types of indexes: clustered and non-clustered. Non-clustered indexes can come in many flavors, however, let’s start with defining the term "clustered indexes."

Clustered Indexes

A clustered index stores the actual data rows in the leaf levels. This means that the clustered index for a table and data of that table are stored at the same place. Clustered indexes are organized into ranges of data. So, if you need to search for ranges (e.g. between two date values), clustered indexes are the way to go. Also for fast access, the rows are sorted (either ascending or descending) by the index key value. As the data is physically sorted in order by the index key, we cannot have more than one clustered index on a table, otherwise we will have the data arranged in two different orders. SQL Server creates a clustered index by default to match the primary key. But, you can also define a clustered index on any column and then define the primary key on some different column(s). Also, keep in mind that clustered indexes should be defined on columns that are “ever-increasing”, which is a point I elaborate on later.

Non-clustered Indexes

The leaf levels of a non-clustered index contain the index key value and the pointer to the actual data row. Effectively, the query takes one more step to gather the data. The structure of the pointer depends on whether it is pointing to a heap or to a clustered table. If pointing to a clustered index, it uses the value from the clustered index to navigate to the correct data row. If referencing a heap (table without a clustered index), it points to the actual data row. This is called key lookup. That is, once SQL Server identifies the rows that are needed, it has to retrieve the column information for those rows from the data pages of the table. So, SQL Server has to navigate through the clustered index to get the data for all the rows that are required.

Key lookups increase as the number of rows increases in the result-set. More importantly, the cost associated with the lookups increase. At some point of time, this cost might overshoot the benefits of the non-clustered index. In such a case, the query optimizer might overlook the non-clustered index and instead go for a full table scan. You can force the query optimizer to do the other way and use the non-clustered index, but it is better not to do so, as the optimizer is much better at creating better plans than most people.

You can create up to 255 non-clustered indexes on a table. Also, you can add other columns to your non-clustered index, which I will explain in the "Covering Indexes" section.

Covering Indexes

To make the most of indexes and overrule the cost effects of lookups, they should be designed in a way that they cover most of the queries expected to be hit on that table, i.e. they should have the columns that are most likely to be in the query.

Non-clustered indexes have few limitations when we talk about covering queries:

A 900-byte size limit on indexes
A maximum of 16 columns can be named in the index
Columns with data types like nvarchar(max), text, and ntext cannot be used in indexes

Fortunately, Microsoft addressed most of these problems in SQL Server 2005 - covering indexes. With a covering index, we can include non-key columns in the index. The non-key columns are added to the leaf level of the index. If you expect search queries on your table to be more driven by some columns than others, you can include those columns in your non-clustered index. So, SQL Server does not need to lookup the address of actual data rows from the index. Rather, it gets all the information in one step.

With a covering index, you can have:

A maximum of 1023 columns as the non-key columns in the index
Columns with data types like nvarchar(max) in the non-key columns of the index

You should also be careful while using covering indexes. While it may be tempting to include a lot of columns in your index, keep in mind that they also affect I/O, cache efficiency and disk space just like the key columns of the index.

Filtered Indexes

SQL Server 2008 came up with a number of new and useful features, with filtered indexes being one of them. A filtered index is an optimized, non-clustered index designed to cover queries from a subset of the data in the underlying table. Simply, it is a non-clustered index with a where clause that filters data based on some condition(s). A well-designed filtered index can have many advantages. It can substantially improve the query performance because it is smaller than the full table non-clustered index. Also, it has filtered statistics, which are more accurate than the full table statistics. So, a filtered index also enhances the plan quality. Because a filtered index is smaller than the full table non-clustered index, the maintenance cost associated with them are lower(as maintenance is required only when data covered by the filtered index is changed). There are times when we might not need a full table non-clustered index. In such cases, we can go for filtered indexes as they also reduce the disk requirements for the index.

To effectively use these advantages you need to be aware of what data is in your table and what queries are expected on it. You need to know the subsets within your data and how they map up to the queries fired against the table.

How Data is Stored

An index consists of pages arranged in a B-tree. The root node sits at the top with leaf nodes at the bottom.

When a query is fired against any column in the index, the query engine starts from the root node. It keeps going down from node to node until it reaches the correct leaf node where the required data is. Now if the index is clustered, the engine has reached where the actual data is stored. If the index is non-clustered, the engine gets the pointer to the actual data page, which is where the data is stored, from the leaf node.

Fragmentation

Before moving on to what fragmentation is, you should know about page splits and fill factor. Page splits occur when data is inserted into a data page that is full. SQL Server splits the page in approximately equal halves and creates the space for the new data. Page splits affect the performance of indexes as creating a new page results in fragmentation. Also, the process is very resource-intensive.

MSDN defines fill factor as “a percentage from 0 to 100 that specifies how much to fill the data pages after the index is created.” Setting the fill factor to 100 means there will be no space left in the pages. It should only be done for read-only tables as there will be no insertions, thus no page splits. Any lower value means there will be some space left in each page for values coming in future. It might be tempting to make this value pretty low, say 50, but it affects the performance since only half the data is read in one fetch by the server. So, be careful on this aspect while creating your indexes.

Basically, fragmentation is inefficient use of pages within the index. Inefficient use of pages may result because of an improper logical order of pages or because of incomplete usage of space in the data pages. In both the cases, this could result in drastic drop in performance of your indexes. Index fragmentation can be either external or internal.

External Fragmentation

When an index is created, the pages are logically arranged in order. When new rows are inserted in the table, new keys might be inserted in between these keys. For this to happen, SQL Server might need to create new data pages. These new pages are generally not adjacent to the existing pages. So, we might have the pages with a different physical order than the logical order. This is called external fragmentation.

Let’s understand this with the help of figures. For example, initially we had two data pages. Now if 4 is added to the table, a new page will have to be created to accommodate it.

With a query returning values from 3 to 11, the query engine has to fetch another data page. For a few pages, this may not matter much to the performance of the query. However, imagine the same scenario with millions of pages and thousands of insertions every day. In a nutshell, external fragmentation can prove to be one of the biggest culprits in spoiling the performance of your database system.

Internal Fragmentation

Internal fragmentation is the direct consequence of non-zero fill factor. It occurs when the index pages are not being used to their full capacity. It may or may not be bad for the system performance. For a table with large number of inserts, this will actually enhance the system performance. But, having a lot of internal fragmentation affects the performance as the number of reads are increased to read the same amount of data.

Design Considerations for Clustered Indexes

By default, a clustered index is matched with the primary key on the table. But, you can define the clustered index on one column and then create the primary key on some other column(s). In this case, the primary key would be created as a non-clustered index.

Clustered indexes store the data in ranges, say, 100-200 in one node and 200-300 in another node. So, if you have query searching for a range of data, it would be better if you define the clustered index on the column used to filter the data based on range. Also, if you have to search on a range for an audit log, it would be better to create clustered index on the date column. On the other hand, non-clustered indexes perform better for specific value searches.

Clustered indexes should be defined on “ever-increasing” columns. A date column in which older dates are not inserted can be an ever-increasing column. An identity column is always an ever-increasing column. It is important because if the columns are not ever-increasing, SQL Server would need to allocate space between existing ranges for future records rather than placing them at the end in new ranges. In this case, when the range fills up and a new value in between comes up, SQL Server will need to do a page split to allocate the new data. So, effectively we have two pages now, thus increasing the search time (and the resources it will take). You may want to approach in a slightly different manner by decreasing the fill factor (say to 65%). Now you have 35% free space for the incoming values. But the problem with this approach is the need of regular re-indexing to maintain the free space, which incurs a heavy processing cost to the server as it has to move the data along with rebuilding the non-clustered indexes.