Rambling about database technologies: Denali

Showing posts with label Denali. Show all posts

Wednesday, 22 August 2012

Columnar Databases and SQL Server Columnstore Indexes

Lately I have been working on Vertica - a column based database from HP and I have to admit that I am amazed by its capabilities and the functionalities it offers, especially when we are talking about relatively big data. The basic difference between SQL Server and Vertica is the way they store data- SQL server being a row store database and Vertica being a column store database. With Denali, SQL Server has also come up with something on the similar lines with ColumnStore indexes (codenamed Apollo). In this article I will talk about columnar databases, how they are different from the row store databases that we use everyday and why do we need them. Also, I will try to figure out what SQL Server has to offer with Apollo and how it stands up to any other column store database.

Column store databases

A column store database is a database which stores table data as sections of columns of data rather than as rows of data. That means all the values for one particular column will be stored together and these sets of data might be distributed for different columns. Different column values for same row are linked by pointers within. When any query is run against a row store database, it fetches data from all the columns for the rows that match the query conditions. In contrast, when the same query is run against a column store database, only those columns are read which are required by the query. So if you consider a table with 30 columns and your query needs( this is often less than 15% of the columns in a typical fact table, according to a Microsoft whitepaper) only five of those fields the column store database will at least give you 6x performance over its row store counterpart (provided the data in each column is similar). Considering the fact that disk access is still the slowest operation for our computers, this becomes significant because of much lesser pages to be read from the disk. Also, the columns which are not used in the query are not loaded at all in the memory which becomes essential to avoid frequent PLE drops when dealing with large amount of data.

Also, most of the column store databases use high values of compression for the data they store. The extremely high values of compression are results of the redundancy we usually see in columns. Again, the values in the same column are invariably of the same type, unlike the data you will see in a row, thus facilitating even more compression. Most of the column store database systems use more than one type of compression schemes based on the type of values stored in the column. For instance, Vertica uses Run Length Encoding schemes to store values in a column that are highly repetitive. These encoding schemes allow the database system to compress each column separately based on the underlying data values which can’t be done in a row store database because of different types of values stored in each row for each column.

So if column store databases are so good, why do we use row store databases at all? Moreover, why row store databases are ruling the market when they are so inept when compared with column store databases? The point is that row store databases have their own advantages. You can’t use a column store database for a small table with hundreds of OLTP transactions per minute. A row store database will be much faster than a column store database if you want to see all the columns of the table in your resultset (because different columns are stored at different places on the disk and need multiple disk seeks). In a nutshell, there are specific use cases where you should use column store databases. The rise of column store databases has a lot to do with the rise of BI and big data. So generally column store databases should be used where traditional row based databases fail in delivering a satisfactory performance. Usually, such cases will be when you are trying to run analytic queries on significantly big data or when you need only few columns of a large table with a number of columns (a fact table for a star schema perfectly suits the description). Row store databases fail because they are just not designed for the purpose.

ColumnStore indexes in SQL Server

Now that you have a basic understanding of what a columnstore is, let’s talk about this new feature of Denali which is known as ColumnStore Index (codenamed Apollo). A columnstore index stores each column in a separate set of disk pages as compared to a heap/ B-tree which store multiple rows per page.

It brings the obvious advantages of a column store, easily compressible data and fetching only the data that we actually need in the query, to a normal SQL Server table. The columnstore index in SQL Server employs Microsoft’s proprietary VertiPaq technology. SQL Server columnstore indexes are pure column stores. That means that the data is stored and compressed in column-wise fashion and individual columns can be accessed separately from other columns. Another type of column store is hybrid column stores which stores data as set of rows, but within that set of rows, data is organized and compressed in column-wise fashion.

Once you create a columnstore index on a table the query optimizer will choose to use a heap/B-tree or the index created by you based on the query and the cost of the query. If the optimizer chooses the columnstore index, but you feel that the underlying B-tree can perform better, you can always query hints to use the B-tree instead of the columnstore index. Remember that SEEKS are not supported by columnstore indexes. That means, if you use the table hint FORCESEEK, the SQL query optimizer will not consider the columnstore index and will rather go for the underlying B-tree or heap. Also, SQL Server columnstore indexes can use batch mode processing to even better the performance of the query.

While the performance benefits of columnstore index is quite nice, one can argue using alternative methods to achieve the same functionality, namely pre-computed summary tables or indexed views. These alternatives will surely work but they don’t offer the flexibility that columnstore indexes do. Pre-computed summary tables will approximate the effect of a columnstore but they will not work you have to change your query. Say, even if you have to add just one aggregate function to the query, you will need to change the structure of the summary table. Having a columnstore index on relevant columns will give you the flexibility to run any query with any aggregate function without changing anything in the base table. Also, if you don’t have a fixed set of queries to be run it can become very difficult to maintain pre-computed tables. Another alternative can be to use covering indexes. In most cases the columnstore index will be much more compressed that the covering index. If the query is too selective the covering index might work faster than the columnstore index (keep in mind that column stores are always better for big data and almost always poorer for OLTP queries). Also, columnstore index can use batch mode of processing to substantially speed up the query.

Limitations on using a columnstore index

Now that you know the advantages of columnstore index in SQL Server, it’s high time you know the flip side of the coin. SQL Server columnstore index comes with its own set of limitations, and unfortunately there are a lot of them. Among the numerous limitations the most glaring one is the inability to update a table with a columnstore index. Microsoft actually proposes some workarounds to achieve this in the whitepaper on columnstore indexes but this can definitely turn out to be the single biggest factor that actually stops you from using a columnstore index. A columnstore index can’t be clustered though Microsoft says that it will change with the future releases of SQL Server. Also, you cannot modify a columnstore index using ALTER INDEX statement. If you want to do so, you must drop and recreate the columnstore index. You cannot do a custom sort in a columnstore index by using ASC/DESC because columnstore indexes are ordered according to the compression algorithms. You cannot combine columnstore indexes with either replication or change data capture. As columnstore indexes already use compression you cannot combine it with page or row level compression offered by SQL Server. There are some more limitations that you can refer to in the whitepaper or on the msdn page for columnstore indexes.

I agree that the excitation created by use of column stores in a mainstream row store database is somewhat dampened by the numerous limitations that comes bundles with it but nonetheless it is a great initiative by Microsoft to try and come up with something meant for big data. The future looks destined for huge growth in the amount of data that we deal with and that will only prompt the growth of columnar databases. Right now SQL Server columnstore indexes stand nowhere when compared to full columnar database in terms of functionalities and performance but at least the guys at Redmond know what they are doing and we can definitely hope for a plethora of improvements in columnstore indexes in the future SQL Server releases. Also, I am looking forward to a more direct and tighter integration of columnstore indexes and SQL Server Analysis Services. I hope you enjoyed reading this little article. Your comments and feedback are welcome as always.

Wednesday, 25 April 2012

New Features of SQL Server Denali

We have just been greeted by another new version of SQL Server in the form of SQL Server 2012/Denali. This release comes with a number of new features. In this article I talk about a few of them. AlwaysON is definitely going to be one of the most exciting new things that Microsoft has added to SQL Server and seems set to change the high availability methods used by most of the organizations. Contained databases will solve a lot of problems for DBAs who have to regularly move databases from one instance to another. And finally I talk about a new function which is added to the release and that is going to help developers a lot in future.

AlwaysON/HADRON

While database mirroring is one of the many distinguished members of the SQL Server high availability features for a while now, we will be witnessing another toddler crawling his way in this club with the release of SQL Server 2012 aka Denali. And if the forecasts are to be believed, this little guy is going to rock big time.

But why did I mention mirroring while I am supposedly bragging about AlwaysON? Because, AlwaysON can be called Mirroring 2.0 for more than one reasons. A lot of things have been upgraded to create this little toddler. So let’s talk about what all things have been changed moving from mirroring to HADRON. (As a matter of fact, MS says mirroring will be removed after two releases.)

With mirroring we were not able to use the data on the mirror database. Though sometimes SQL Server used the mirror to accomplish page repairs, no one else could have done anything useful with the mirror copy of data. What we could have done was to take a snapshot of the database and query the snapshot. But this approach would hardly help us if we were looking for taking the load off the production server.

Database mirroring didn’t allow us to use it for disaster recovery in different data centers and for high availability within the same data center. We could have used any one of them but not both. So mirroring was essentially HA or DR but not HADR (now you see where the name comes from).

Moving on, database failovers can’t be coordinated by a DBA as they are database-level events. If we have applications which depend on many databases, we could not use automatic failovers.

With AlwaysON, a DBA can create availability groups, crudely a set of databases that can be failed over simultaneously to support any application which needs them. So, now you need not manually failover each database that you need for your application. With AlwaysON you can have up to four replicas (known as mirrored servers in pre-HADRON era). These replicas can be connected synchronously (HA) as well as asynchronously (DR). So you can have increased availability and uptime. Moreover, these replicas are not passive. These are active and can provide read-only access. So you can easily take the load off your production server if you only want read access (a very popular such scenario is taking reporting services off your production server and putting it to one of the secondary server). Also, as the replicas are active, you can take full and differential backups on the secondary.

The other side of HADRON

The biggest downside with AlwaysON is the much more difficult setup process than mirroring. While mirroring won many hearts with its extremely easy setup process, AlwaysON relies on clustering that is infamously known for the hardships it presents while setup. But as they say all good things come with a price AlwaysON will be there with Windows clustering. Probably a dose of PowerShell might help the DBAs here. People just can’t get scared and leave such a good gift from the guys at Redmond !!!

Contained Databases

Any DBA must be aware of the issues when a database is backed up from one instance and restored to another instance. In other words, some people say that SQL Server databases are not very ‘portable’. That’s true because when you restore or attach any SQL Server database that is part of some application you miss a lot of things that are part of the application and must be ported along with the database. Some people might argue that those objects are not part of the database and are ‘outside’ objects but definitely they need to be shipped along with the database else the application will no longer be usable. These ‘outside’ objects include CLR objects, roles, linked servers, database mail, Service Broker objects, replication, SQL Server agent jobs etc. Also, if the target instance has a different collation, you will not be able to create temp tables on the new database and you might not be able to connect to different databases.

With Denali, Microsoft has presented a solution to these problems in the form of contained databases. A contained database can be seen as a black box which has all that it needs to use the database on any SQL Server instance. A contained database includes all the required settings, metadata and operates in isolation from the SQL Server Database Engine. In other words it has no functional dependency on SQL Server Instance be it Login, collation settings or metadata info. So it can be moved to any instance without affecting the applications that are using the database.

In Denali, the first iteration of the contained databases (which will be called as Partially Contained Database) feature will allow you to create a database-specific user without a login (and you can create multiple such users with the same name for different databases). Also, you can use a DMV to show most objects or code you have that will threaten containment.

I will not give the implementation details here but I will nevertheless talk about few things,

Fully contained vs. Partially contained database

A fully contained database does not allow any objects or functions that cross the application boundary whereas a partially contained database is a contained database that allows features that cross the application boundary. Inside of the application boundary is the application model, where the applications are developed and managed. Examples of entities located inside of the application model include, system tables like sys.tables, contained database users with passwords, and user tables in the current database referenced by a two-part name. Outside of the application boundary is the management model, which pertains to instance-level functions and management. Examples of entities located outside of the application boundary include, system tables like sys.endpoints, users mapped to logins, and user tables in another database referenced by a three-part-name.

Users without login

You can now create a user not tied to any login.

USE [ContainedDB];GO
CREATE USER [NewUser] WITH PASSWORD = 'DB1';
GO

USE [ContainedDB2];GO
CREATE USER [NewUser] WITH PASSWORD = 'DB2';
GO

The question arises, if there are two users with the same name, how does SQL Server figure out which one you mean? Users are now authenticated at the database level first (database is specified in the connection string), and SQL Server will try to find a login with the same name if no user is found at the database level. So, if you specify a valid contained database user name but no database context, the credentials will fail even if the password is correct.

While contained users sound great if you want to move databases from one instance to another, this is even more fun with AlwayON. Consider a scenario where your application failed over to the other server in your cluster. If you are using contained databases for your application, you have nothing to worry as the users will be automatically connecting to the new server with your database already having the users.

Threats to containment

Even if you are going to implement database containment, there may be parts of your application which can’t be contained and which can make the containment fail. Denali has come up with a new DMVsys.dm_db_uncontained_entities to trace out such objects. Not only this DMV will point out the objects, but for procedure / functions / views / triggers, it will also highlight the lines in those objects which are creating the problems. For instance, if a procedure is using dynamic SQL five times, this DMV will list the procedure five times, each time with the dynamic SQL that was creating the problem. Also, if you have referenced any object that does not exists yet, it will also be picked up by the DMV as the object might break the containment after it is created.

The Format() function

This string function is introduced in Denali to support formatting of data. In previous versions of SQL Server we used convert function for the same but come March we will be having a deserving successor to it. The function will have three parameters (though the third one is optional), first parameter is the data value you want to format and the second one specifies the format. But the most interesting of them is the third parameter. You can specify the ‘culture’ for the format you want your data to be in using the third parameter. You can specify the country or language for your format using the third parameter. Let’s have some examples,

Syntax : FORMAT ( value, format [, culture ] )

Date in local language,

SELECT FORMAT (GETDATE(), N'dddd MMMM dd, yyyy', 'hi') AS DateinHindi;

SELECT FORMAT (GETDATE(), N'dddd MMMM dd, yyyy', 'fr') AS DateinFrench;

SELECT FORMAT (GETDATE(), N'dddd MMMM dd, yyyy', 'en') AS DateinEnglish;

Fig 1. Date in local language

Date in location aware format

DECLARE @date DATETIME = '01/26/2012';SELECT FORMAT ( @date, 'd', 'en-US' ) AS US_Date;SELECT FORMAT ( @date, 'd', 'fr-FR' ) AS FR_Date;

Fig 2. Date in location aware format

Currency in local format

DECLARE @money INT = 100;

SELECT FORMAT ( @money, 'c', 'en-In' ) AS India_Currency;

SELECT FORMAT ( @money, 'c', 'en-US' ) AS US_Currency;

SELECT FORMAT ( @money, 'c', 'fr-FR' ) AS France_Currency;

Fig 3. Currency in local format

Different currency formats

DECLARE @money INT = 100

SELECT FORMAT(@money,'c') AS Currency_Default;

SELECT FORMAT(@money,'c1') AS Currency1;

SELECT FORMAT(@money,'c2') AS Currency2;

SELECT FORMAT(@money,'c3') AS Currency3;

Fig 4. Different currency formats

Different calculation formats

DECLARE @cal INT = 50

SELECT FORMAT(@cal,'x') AS Hexadecimal;

SELECT FORMAT(@cal,'x3') AS HexadecimalAgain;

SELECT FORMAT(@cal,'p') AS Percentage;

SELECT FORMAT(@cal,'e') AS Scientific;

Fig 5. Different calculation formats