Lately I have
been working on Vertica - a column based database from HP and I have to admit
that I am amazed by its capabilities and the functionalities it offers,
especially when we are talking about relatively big data. The basic difference
between SQL Server and Vertica is the way they store data- SQL server being a
row store database and Vertica being a column store database. With Denali, SQL
Server has also come up with something on the similar lines with ColumnStore
indexes (codenamed Apollo). In this article I will talk about columnar databases,
how they are different from the row store databases that we use everyday and
why do we need them. Also, I will try to figure out what SQL Server has to
offer with Apollo and how it stands up to any other column store database.
Column store databases
A column store
database is a database which stores table data as sections of columns of data rather than as rows of data. That
means all the values for one particular column will be stored together and
these sets of data might be distributed for different columns. Different column
values for same row are linked by pointers within. When any query is run
against a row store database, it fetches data from all the columns for the rows
that match the query conditions. In contrast, when the same query is run
against a column store database, only those columns are read which are required
by the query. So if you consider a table with 30 columns and your query needs(
this is often less than 15% of the columns in a typical fact table, according
to a Microsoft whitepaper) only five of those fields the column store database
will at least give you 6x performance over its row store counterpart (provided
the data in each column is similar). Considering the fact that disk access is
still the slowest operation for our computers, this becomes significant because
of much lesser pages to be read from the disk. Also, the columns which are not
used in the query are not loaded at all in the memory which becomes essential
to avoid frequent PLE drops when dealing with large amount of data.
Also, most
of the column store databases use high values of compression for the data they
store. The extremely high values of compression are results of the redundancy
we usually see in columns. Again, the values in the same column are invariably
of the same type, unlike the data you will see in a row, thus facilitating even
more compression. Most of the column store database systems use more than one
type of compression schemes based on the type of values stored in the column.
For instance, Vertica uses Run Length Encoding schemes to store values in a
column that are highly repetitive. These encoding schemes allow the database
system to compress each column separately based on the underlying data values which
can’t be done in a row store database because of different types of values
stored in each row for each column.
So if
column store databases are so good, why do we use row store databases at all?
Moreover, why row store databases are ruling the market when they are so inept
when compared with column store databases? The point is that row store
databases have their own advantages. You can’t use a column store database for
a small table with hundreds of OLTP transactions per minute. A row store database
will be much faster than a column store database if you want to see all the
columns of the table in your resultset (because different columns are stored at
different places on the disk and need multiple disk seeks). In a nutshell,
there are specific use cases where you should use column store databases. The
rise of column store databases has a lot to do with the rise of BI and big
data. So generally column store databases should be used where traditional row
based databases fail in delivering a satisfactory performance. Usually, such
cases will be when you are trying to run analytic queries on significantly big
data or when you need only few columns of a large table with a number of
columns (a fact table for a star schema perfectly suits the description). Row
store databases fail because they are just not designed for the purpose.
ColumnStore indexes in SQL Server
Now that
you have a basic understanding of what a columnstore is, let’s talk about this
new feature of Denali which is known as ColumnStore Index (codenamed Apollo). A
columnstore index stores each column in a separate set of disk pages as
compared to a heap/ B-tree which store multiple rows per page.
It brings
the obvious advantages of a column store, easily compressible data and fetching
only the data that we actually need in the query, to a normal SQL Server table.
The columnstore index in SQL Server employs Microsoft’s proprietary VertiPaq
technology. SQL Server columnstore indexes are pure column stores. That means that the data is stored and
compressed in column-wise fashion and individual columns can be accessed
separately from other columns. Another type of column store is hybrid column stores which stores data
as set of rows, but within that set of rows, data is organized and compressed
in column-wise fashion.
Once you
create a columnstore index on a table the query optimizer will choose to use a
heap/B-tree or the index created by you based on the query and the cost of the
query. If the optimizer chooses the columnstore index, but you feel that the
underlying B-tree can perform better, you can always query hints to use the
B-tree instead of the columnstore index. Remember that SEEKS are not supported
by columnstore indexes. That means, if you use the table hint FORCESEEK, the SQL
query optimizer will not consider the columnstore index and will rather go for
the underlying B-tree or heap. Also, SQL Server columnstore indexes can use
batch mode processing to even better the performance of the query.
While the
performance benefits of columnstore index is quite nice, one can argue using
alternative methods to achieve the same functionality, namely pre-computed
summary tables or indexed views. These alternatives will surely work but they
don’t offer the flexibility that columnstore indexes do. Pre-computed summary
tables will approximate the effect of a columnstore but they will not work you
have to change your query. Say, even if you have to add just one aggregate
function to the query, you will need to change the structure of the summary
table. Having a columnstore index on relevant columns will give you the
flexibility to run any query with any aggregate function without changing
anything in the base table. Also, if you don’t have a fixed set of queries to
be run it can become very difficult to maintain pre-computed tables. Another
alternative can be to use covering indexes. In most cases the columnstore index
will be much more compressed that the covering index. If the query is too
selective the covering index might work faster than the columnstore index (keep
in mind that column stores are always better for big data and almost always
poorer for OLTP queries). Also, columnstore index can use batch mode of
processing to substantially speed up the query.
Limitations on using a columnstore index
Now that
you know the advantages of columnstore index in SQL Server, it’s high time you
know the flip side of the coin. SQL Server columnstore index comes with its own
set of limitations, and unfortunately there are a lot of them. Among the numerous
limitations the most glaring one is the inability to update a table with a
columnstore index. Microsoft actually proposes some workarounds to achieve this
in the whitepaper on columnstore indexes but this can definitely turn out to be
the single biggest factor that actually stops you from using a columnstore
index. A columnstore index can’t be clustered though Microsoft says that it
will change with the future releases of SQL Server. Also, you cannot modify a
columnstore index using ALTER INDEX statement. If you want to do so, you must drop and recreate the columnstore
index. You cannot do a custom sort in a columnstore index by using ASC/DESC
because columnstore indexes are ordered
according to the compression algorithms. You cannot combine columnstore indexes
with either replication or change data capture. As columnstore indexes already
use compression you cannot combine it with page or row level compression
offered by SQL Server. There are some more limitations that you can refer to in
the whitepaper or on the msdn page for columnstore indexes.
I agree
that the excitation created by use of column stores in a mainstream row store
database is somewhat dampened by the numerous limitations that comes bundles
with it but nonetheless it is a great initiative by Microsoft to try and come
up with something meant for big data. The future looks destined for huge growth
in the amount of data that we deal with and that will only prompt the growth of
columnar databases. Right now SQL Server columnstore indexes stand nowhere when
compared to full columnar database in terms of functionalities and performance
but at least the guys at Redmond know what they are doing and we can definitely
hope for a plethora of improvements in columnstore indexes in the future SQL
Server releases. Also, I am looking forward to a more direct and tighter
integration of columnstore indexes and SQL Server Analysis Services. I hope you
enjoyed reading this little article. Your comments and feedback are welcome as
always.