The Basics of Indexing for Memory Optimized Tables

As you have probably heard, SQL Server 2014 features a new In Memory OLTP engine, previously known as Heakton. One of the big features of In Memory OLTP is Memory Optimized Tables; tables which are entirely loaded into memory, rather than only residing on disk and being cached based on use. As you can imagine, storing tables in memory means they are accessed differently, and therefore indexed differently. We’ll take a look at some of the basics of how indexing works for Memory Optimized Tables here.

Our Memory Optimized Tables have two options for types of indexes: Hash Indexes and Nonclustered BW-trees. We’ll take a look at the benefits of each in a moment, but first lets consider the details that apply to both.

Memory Optimized Indexes:

– Exist only in memory. All indexes are rebuilt on database start up.

– Do not contain actual data rows. Instead, data rows for a given table exists only in one place in memory and indexes are made up of memory pointers to those rows.

– Do not duplicate data. All indexes on a table point to the same rows in memory, the data is not duplicated.

– Are all covering indexes. Indexes are made up of pointers to full rows, they automatically cover all columns of a table.

– Are limited to 8 per table. This can be any combination of the types of indexes.

– Must be created along with the table. All memory optimized indexes are created inline in the CREATE TABLE statement.

Types of Memory Optimized Indexes:

Hash Indexes

Hash indexes are made up of hash buckets with linked lists of rows. They are optimized for point look ups and inserts. They do not support range scans or inequality predicates, as the values are ordered by a hash, and therefore are not necessarily in order by the index key. For the same reason, they can not be used for ordered scans. Let’s look at the DDL for a hash index. Here’s an example of creating a hash index at the column level:

CREATE TABLE [foo](
[bar] INT NOT NULL
INDEX IX_bar HASH WITH (BUCKET_COUNT = 100),
[boo] VARCHAR(200) NOT NULL
)
    WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);

You can also create the hash index at the table level:

CREATE TABLE [foo](
[bar] INT NOT NULL, 
INDEX IX_bar HASH ([bar]) WITH (BUCKET_COUNT = 100)) 
    WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);

As you can see, this is actually table creation DDL. As mentioned above, all memory optimized indexes must be created along with their table.

You’ll notice in each example that we have WITH (BUCKET_COUNT = #). The bucket count is an important part of defining  a memory optimized index. The hash function SQL Server uses for hash indexes hashes each value for the index key and maps it to a hash bucket. Each hash bucket points to the first row of data with the corresponding index key, and that row points to the next row and so on. It follows that for best performance we want each possible index key value to have its own bucket and to point to as few rows as possible. For these reasons, Microsoft recommends that we define our hash indexes with double the number of buckets of our number of unique index key values, and to use a regular nonclustered memory optimized index for tables with many repeating index key values.

It is important to avoid having too few buckets, and so we should always go for a higher bucket count if we’re not sure. If we have too few buckets, the hash function will map multiple index key values to a bucket, which is called a hash collision, and can seriously degrade performance.

Nonclustered Indexes:

The second kind of memory optimized index is a nonclustered index. These special nonclustered indexes are based on the new BW-tree. This architecture was named with tongue firmly placed in cheek, as BW stands for “buzz word”. The name is fitting, however, as these new type of B-trees have some great functionality.

These nonclustered indexes support all operations and are your safer option to try when migrating to the new in memory architecture. They are optimized for range look-ups and ordered scans, support point look-ups (but don’t do them quite as fast as a good Hash  index) and also support searching on left-based subsets of index keys.

Let’s take a look at the DDL. Here is how we create a column level nonclustered index:

CREATE TABLE [foo]( 
[bar] INT NOT NULL INDEX ID) 
   WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);

And here is how we create a table level nonclustered index:

CREATE TABLE [foo](
[bar] INT NOT NULL, 
INDEX IX_bar NONCLUSTERED ([bar])) 
   WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);

Keep in mind, these indexes are unidirectional, meaning that they can only be used if the columns are in the same order as your queries. So, if you are going to use an ORDER BY clause, you need to be sure your corresponding indexes are ordered in the  same way.

 

Advertisements
Tagged with: , , , ,
Posted in In Memory OLTP, Indexing, SQLserver

Using Resource Governor to set MaxDOP for specific SQL Server jobs

I was recently reading a great post by Norman Chan about the use of maximum degree of parallelism, or MaxDOP, settings to decrease index rebuild times. Norman has some interesting information in his post and I suggest reading it. The specific case he is looking at is a system in which MaxDOP is set to a value lower than 8, which he finds has the best performance. This post is a quick rundown on an alternate way to implement a custom MaxDOP for specific SQL Server jobs, such as index maintenance.

This process requires the use of the Resource Governor. Keep in mind that this feature is available in Enterprise Edition only.

Per technet, “Resource Governor enables you to manage SQL Server workloads and resources by specifying limits on resource consumption by incoming requests”. Conveniently, one of the settings Resource Governor allows us to set for specific requests is MaxDOP. The reason Resource Governor is a good way to handle this is that changing the server-wide MaxDOP setting will clear your entire procedure cache, however setting it for specific requests with Resource Governor does not.

The first step is to create our resource pool:

CREATE RESOURCE POOL [indexMaintenance] WITH(min_cpu_percent=0,
             max_cpu_percent=100,
             min_memory_percent=0,
             max_memory_percent=100,
             AFFINITY SCHEDULER = AUTO
)
GO

These are default values and should not affect performance.

We then need to create a workload group:

CREATE WORKLOAD GROUP [wg_indexMaintenance] WITH(group_max_requests=0, 
        importance=Medium, 
        request_max_cpu_time_sec=0, 
        request_max_memory_grant_percent=100, 
        request_memory_grant_timeout_sec=0, 
        max_dop=8) USING [indexMaintenance]
GO

Here, everything but the max_dop and request_max_memory_grant_percent settings are a default value. This is where our custom max_dop is set. Note that request_max_memory_grant_percent defaults to 25, meaning it would only consume 25% of the maximum memory available to the resource pool we associate it with, indexMaintenance. I have set this to 100 here, as there will likely not be other requests sharing the pool. Bare in mind that these settings are effectively “wide open” and tweak as your system requires.

Next, we need to create the user that will be associate with our workload group and give them permissions:

USE [master]
GO
CREATE LOGIN [indexMaintenance_user] WITH PASSWORD=N'ChangeThis', DEFAULT_DATABASE=[master], CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
GO
ALTER SERVER ROLE [sysadmin] ADD MEMBER [indexMaintenance_user]
GO

Please change your password to a strong password.

Next, we need to create a classifier function, which is the function SQL Server will run against new logins in order to determine their proper workload group:

CREATE FUNCTION fnRGClassifier()
RETURNS sysname
WITH SCHEMABINDING
AS
BEGIN

       DECLARE @group sysname
     IF(SUSER_NAME() = 'indexMaintenance_user')
     BEGIN
          SET @group =  N'wg_indexMaintenance';
     END
 -- Use the default workgroup if there is no name match   
ELSE
        BEGIN
             SET @group =  N'default';
       END
       RETURN @group;
END;
GO

This is a basic function based on code I originally received from Jonathan Kehayias(blog|twitter). Note that if you have other workload groups being used in your system, you will need to modify this code to include them.

We then set Resource Governor to use this function:

ALTER RESOURCE GOVERNOR with (CLASSIFIER_FUNCTION = dbo.fnRGClassifier);
ALTER RESOURCE GOVERNOR RECONFIGURE;
GO    

We now have Resource Governor configured with a classifier function that will route all requests from our new user, indexMaintenance_user, to our workload group wg_indexMaintenance, which has a custom MaxDOP of 8, meaning all requests from indexMaintenance_user will run with MaxDOP 8 without having to change any server wide settings!

From here, we just need to make sure that our index maintenance processes run as indexMaintenance_user. This is as simple as adding the following to the beginning of our Index Maintenance Job step, generally:

EXECUTE AS LOGIN = 'indexMaintenance_user';

Here’s an example:

indexmaintstep

The EXECUTE AS statement could also be added to the beginning of your index maintenance stored procedure.

Keep in mind that this method can be used to apply restrictions to any other kind of job as well.  I hope you find this tip useful. If you use Resource Governor for other interesting things in SQL Server, leave a comment!

Tagged with: , ,
Posted in SQLserver

Using Replication and Indexed Views to create a Reporting Server: Part 2

In my previous post, we looked at an overview of the two technologies we are combining to create a SQL Server reporting server: Indexed Views and Replication. As we saw, replication allows us to move our data from one server to another and indexed views allow us to prepare our data to be used by reports ahead of time, dramatically increasing speeds. Now that we have an idea of what each of these two technologies does, let’s take a look at how they come together. In this explanation, we’re assuming that both transactional replication and our indexed view have already been set up. We’ll cover how that is done in later posts.

Generally, if we are trying to set up a reporting server, it is because our primary OLTP (Online Transaction Processing) database isn’t delivering the data fast enough or we want to scale out by moving part of our workload. Our OLTP server is where we will begin tracking our data, where replication will pick it up and eventually move it to our reporting server.

Continuing with our widgets example, imagine that a shipment has come into our warehouse and someone updates the database with a new inventory count for one of our widgets, a sonic screwdriver. In a normal OLTP database, this data gets updated in memory, written to the logs and eventually written to the data file. With our reporting server setup, this all still happens, but some additional processes are also at play.

A Log Reader Agent will watch the log of our OLTP server (which is called the Publisher in replication, because it is the source of our data) for any committed changes to data that we are replicating. When it sees that our sonic screwdriver has had its inventory count changed, it will copy that transaction, and any others that are relevant to replication articles, in a batch to the distribution database.

The distribution database is a completely separate database which can run on the same instance as the publisher, the subscriber’s instance (our reporting server) or on its own instance. It houses the guts of the transaction process and acts as a staging point for any data moving between publishers and subscribers. The update for our sonic screwdriver inventory will wait there until the Distribution Agent moves it to the subscriber.

The distribution agent can run in one of two places, depending how you’ve set up your system. It will run either at the distributor, called a push subscription (data is pushed from the distributor to the subscriber), or at the subscriber, called a pull subscription (data is pulled from the distributor to the subscriber). In either case, it will move data from the distribution database to the subscriber database. In our example, the distribution agent will move the update of our inventory count for our sonic screwdriver from the distribution database to the inventory table in the reporting database. This is the last step along the way that transactional replication handles. From here, our indexed view takes over.

At this point in the process, our inventory count has been updated by a user on our OLTP server, that transaction has been written to the logs where it was picked up by the log reader agent, which brought it to our distribution database. From the distribution database, it was pushed or pulled to our subscriber database by the distribution agent, where it is written into the replicated tables. As I stated in my last post, once we create an index on a view, that data will actually physically exist within our database. However, the underlying data will still exist in its respective tables.

When a table is updated, all of its indexes are updated as well (and in the case of a table with a clustered index, the index actually physically IS the table). This includes the indexes for indexed views. As you’ll see when we create our example indexed view, the data is processed by our view definition and saved in that state, and that is done as the data is being inserted. Think of it as running the new data through our report’s query definition and saving the results into an index. This means that when the distribution agent moves the data to the subscriber (our report database) and inserts it,  our indexed view is also updated. This means that our reports are instantly up to date the moment that data reaches our reporting server, and ready to be read for that report your boss needs immediately! This is really the beauty of indexed views: having the data already processed and saved can increase report speeds dramatically. Reports can go from running for several minutes down to less than a few seconds.

Keep in mind that updating these indexes as the data is being inserted is extra overhead for our server, as extra pages need to be written. This is one place where having a separate reporting server really shines. We don’t have to worry so much about this write overhead because it is abstracted away from our primary OLTP server. Data is written quickly to the OLTP server and then carried over to our reporting server via replication, protecting our write heavy OLTP server from that overhead. The additional write overhead of the indexed views may slow the replication inserts slightly, but it is generally negligible and reports will be up to date within seconds of the data being updated on the OLTP server, in my experience.

In my next post, we will begin looking at how we actually set up our reporting server.

Tagged with: , , ,
Posted in reporting, SQLserver, Uncategorized

Using Replication and Indexed Views to create a Reporting Server: Part 1

Reports that require heavy aggregation and run often can become taxing on our database, and perform poorly. Who wants that? Requiring our system to aggregate the data each time it is requested costs us time, and as any DBA knows, less response time is better. Luckily, SQL Server provides ways to aggregate data ahead of time, and one of the ways of doing this is through indexed views. Coupled with SQL Server’s Transactional Replication technology, we can create a reporting server with pre-aggregated data that will improve the performance of reports by orders of magnitude. In this series we’ll look at how this works, starting with today’s overview.

Well, that sounds great, right? So, how do we do it? We’ll get to that in a later post, first let’s get a basic idea of what each of these two technologies we’re combining do.

What is SQL Server Replication?

SQL Server’s replication technology, in a nutshell, lets us actively move data from one database to another. Those databases do not have to be on the same server. SQL Server will keep the data synchronized and consistent. This allows us to make our data available on one or more secondary servers in near real-time.  This can be useful for numerous reasons, not the least of which are off-loading workloads and reporting.

Replication can require a bit of work to set up but, in my experience, is very stable once configured.

So, we can move our data to a second server. That’s handy and all, but aside from moving the workload, how are we really gaining speed? That is accomplished through Indexed Views.

What is an Indexed View?

To understand an Indexed View, first we need to understand what a View is. A View is basically a way to store a query as an object within SQL Server. So, let’s imagine we have a database that stores data for our e-commerce page, and we need to track inventory. If we have a simple report that finds the name, ID and number available of widgets we currently have in stock:

SELECT i.widgetID, i.numInStock, w.name
FROM inventory i
JOIN widgets w on i.widgetID = w.widgetID
WHERE i.numInStock > 0

We can create a View:

CREATE VIEW currentWidgets AS
SELECT i.widgetID, i.numInStock, w.name
FROM inventory i
JOIN widgets w on i.widgetID = w.widgetID
WHERE i.numInStock > 0

And then:

SELECT widgetID, numInStock, name
FROM currentWidgets

This will give us the same result as our original report query. There are a lot of uses for views, but what we primarily need to understand here is that we can use the text of a report’s query (or sometimes something slightly different, in order to meet some requirements) as the definition of a view to create a view that will give us the same results as that report.

Simply creating a view for a report does not give us any performance gain. Getting the data from a view still runs the underlying query every time we need to get the data, which means any aggregation or joining the reports needs to do is still done at run time. This is where the index part of indexed view comes into play.

When we create an index on a view, rather than the database simply storing the query we used in the definition of the view and running it when we query the view, it actually stores the results of that query as its own object. You can think of this like a special table that is tailored to hold only the results of that specific query. This means any aggregation or joining is done ahead of time, as the data changes, leaving the results immediately available to your when you run your reports.

Note: You will see Indexed Views referred to as Materialized Views in some Database Management Systems.

In my next post, we’ll take a look at how we can combine these two technologies to create a report server. After that, we’ll getting into some details of how it’s done.

Tagged with: , , ,
Posted in reporting, SQLserver

Introduction

Hello, and welcome to Zen and the Art of Database Administration. My name is Alex Barnes and I am a professional Database Administrator. Database administration can be daunting and, at times, overwhelming. My goal is to present useful information about databases, primarily Microsoft SQL Server, regular administrative tasks and the technologies surrounding the systems in a succinct way. Just as Zen Buddhists emphasize enlightenment and insight of the self, we will seek understanding of our systems.

Posted in meta, Uncategorized