Friday, 15 May 2020

Integration Services Perfromance Best Practices - Data Flow Optimization

In this article, we will present a couple of common best practices regarding the performance of Integration Services (SSIS) packages. These are general guidelines for package design and development which guarantee a good performance in most use cases. However, there are always exceptional cases and outliers. Thoroughly test any changes to your packages tp conclude that a change made a positive effect. This means not only running in the designer, but also on the server. Also, this article is not an exhaustive list of all possible performance improvements for SSIS packages. It merely represents a set of best practices that will guide you through the most common development patterns.

Let's get started with data flow performance

Use a SQL statement in the source component

Well, this only applies if  your source supports SQL. For a flat file you just need to select the columns you need. The idea behind this statement is that when you write a SQL statement that will already optimize the data for SSIS data flow:
  • You only select the columns needed.
  • You already perform the necessary data type conversions using CAST or CONVERT
  • You can already perform complex logic such as joining different tables together
  • You can filter the data
  • You can sort the data if needed
Whatever you do, don't use the dropdown box to select  a table. There's no logic you can add it, it will select all columns and all rows of the table and it's actually slower than using a SQL statement (as it uses OPENQUERY behind the scenes).
Chang the Data Access Mode to SQL Command or SQL Command from variable. The second option is useful if you want to dynamically build your query using expressions.

Get as many rows as you can into a buffer

The SSIS data flow uses the memory buffers to manage the data flowing through the data flow. It's very important that you can get as many rows into a single buffer. Imagine a line of people passing buckets to put a fire. The more water you can get into a bucket, the quicker you can out a fire. To achieve this goal, you can take actions:
  • Only select  the columns you need. If you bring unnecessary columns into the data flow, you waste precious buffer space.
  • Keep the columns as small as possible. If the total row length is smaller, you can fit more rows into a buffer. Try converting the columns to a smaller size in the source if possible. The data flow takes the maximum size of each column to calculate the total row length. For example, if a column has the size varchar(4000), the data flow will assume a length of 4000 bytes, even if the actual rows contain much less data.

Don't use the default buffer settings

While we are on the topic of buffers, let's look at some data flow properties:
  • DefaultBufferSize: the default size of a buffer which is set to 10MB.
  • DefaultBufferMaxRows: the default number of rows the data flow will try to put in a buffer, This is set to 10,000 rows.
When calculating the buffer size, the data flow uses these two properties. If one property is reached, the size of the buffer is not further enlarged. For example, if the row size is 1MB, the data flow will put 10 rows into the buffer, since the maximum buffer size will then be met. If the row size is only 1KB, the data will se the maximum 10,000 rows.

These two default stings haven't been changed since the release of SSIS in 2005. Apparently, a time where memory was still crazy expensive. The first thing you'll want to do is set these properties to a much higher value. There is no golden number that will guarantee optimal performance all the time. Sometimes a smaller buffer is better the source is really slow, sometimes a bigger buffer is better if you don't have many columns and very fast source. Keep in mind that buffers can be too big as well. Think about the line of people passing buckets. If the buckets are too big, it takes too long to fill them with water and the fire still rages on. In reality, it takes some testing to find the optimal settings for your data flow. Typically I run my package wit default settings and monitor the execution time. Then I enlarge the buffers to 30MB and 50,000 rows for example and see what effect it has. And then I try to determine if the package is better off with even larger buffers or if they need to be a bit smaller.

Luckily, all this testing can now largely be avoided by one new property introduced in SQL Server 2016: AutoAdjustBufferSize. When this property is set to True, it will automatically adjust the size of the buffer so that the DefaultBufferMaxRows setting is met. For example, if you have set it to 40,000  rows, the data flow will automatically scale the buffer size so that the buffers will contain 40,000 rows. Unfortunately this property is set to False by default.

In conclusion make those buffers bigger. This can really have a drastic impact on the performance.

Avoid blocking transformations

Even if you follow the previous tips to the letter, performance can still be terrible if you use a blocking transformation. These are the types of transformations in a data flow.
  • Non-blocking: The buffers are unchanged and flow through the transformation. The data flow performs the transformation on a memory buffer and moves on to the next. Examples are: derived column, data type conversion, conditional split etc.
  • Semi-blocking: The buffer can change in size; either in the number of columns or in the number of rows. However, there is no blocking behavior. Examples a re Union All, Merge and Merge Join
  • Blocking: The bane of SSIS performance. These transformations need to read all the buffers before they can output even one single buffer. This can lead to memory pressure which causes the SSIS package to spill to disk. In other words, your package runs now for hours instead of minutes or seconds. Examples are Sort and Aggregate
All transformations can be divided into two categories: synchronous and asynchronous. Synchronous means the buffer doesn't change in size at all. All non-blocking transformations are synchronous. Sometimes this isn't obvious: the Multicast component creates multiple output paths, which seems multiple buffers are being created. However, behind the scenes it's still the same memory buffer. Asynchronous components do change the size of the buffer Columns can be added or removed. Rows can be added or removed as well. All blocking and semi-blocking transformations are asynchronous.

The rule is simple: avoid asynchronous transformations. In most cases, these are blocking anyway. The Union All has the benefit of the doubt; it's the bad asynchronous transformation. However, if you can design your package so you can avoid it, it's recommended you do. The Merge and Merge Join are to avoided as well, because they require sorted input. You cam sort in the data flow, but the Sort component is a blocking transformation, so it needs to be avoided. An alternative is to sort in the source component, for example by adding an ORDER BY clause to the SQL statement. The data flow needs to know the data is sorted thought. You can do this by setting the IsSorted property to true on the output in the Advanced Editor.
You also need to indicate which columns are sorted. Keep in mind that this doesn't sort the data for you. This is just metadata information.

Don't use the OLE DB command transformation

Don't use it, Ever. Unless you have really small data sets and you are 100% sure they wont grow. The OLE DB command executes SQL Statement for  every row in the buffer. This means if 1 million rows pass through the transformation, the SSIS package will sent 1 million statements on the server. Most likely the transaction log will blow up nd it might even fill your disks completely.

A work around is to write the records to a staging table and then use a Execute SQL Task to do a set-based SQL statement. In other words, use the Execute SQL Task to run single UPDATE statement instead several thousand UPDATE statements.

Use the FastLoad option

This tip only applies when you are using the OLE DB Destination and if your destination supports the fast load option. It's very common through to have SQL Server as the destination. In that case, use the fast load option. If not, you are inserting records row by row which has a devastating impact on the transaction log.
The default settings work fine in general. You can tweak them a bit for a specific need though. A good use case is when loading data to a clustered column store index. Here you want that only batches of at least 102,400 rows (preferably over a million) are sent to SQL Server. In this case, the batches end up immediately in a compressed row group instead of the delta store (which is row based and thus slower).

In most  cases, the OLE DB Destination - when writing to SQL Server using the Native Client - outperforms all other alternatives. The SQL Server Destination is supposed to be faster, but it has some limitations and nowadays the OLE DB Destination is the best option. OLE DB is deprecated though by Microsoft. This means you can still use the OLE DB connection managers, but there are no new features being released. For example, AlwaysEncrypted is supported by ADO.NET but not by OLE DB.

Prepare your Destination

Just as you need to prepare your source (for example by putting the right indexes into place to speed up reading), you need to prepare your destination. Here are some examples of how you can make your destination go faster (most apply when the destination is SQL Server):
  • Put the destination on a fast disk. SSDs are preferred.
  • For large loads, you can disable constraints on the table and re-enable them after the load. The same applies for indexes: you can drop them first and recreate them after the SSIS package has finished. This avoids fragmentation on your indexes.
  • If you write to a database make sure the database files have the appropriate size. You don't want them to be automatically growing all the time (especially not with the default settings). Estimate what your total data volume is and how it is going to grow in the coming months. Then size your files accordingly.
  • If possible, try to load in parallel. You'll probably need to take precautions to avoid locking and blocking. Try to avoid a hot spot in your clustered index: all the load processes try to insert the last page of the clustered index at the same time. The Balanced Data Distributor can help you to parallelize you data flow. This component is a separate download for SSIS 2012 and 2014, but is built in into SSIS 2016
    • Go for as minimal logging as possible. Set the recovery model to Simple, if your disaster recovery policy allows it. Do bulk inserts (using the fast load options) and configure them to be minimally logged.
Remember if your destination is slow, your source will be slow as well due to backpressure in the data flow. In other words, if your destination can't keep up the with the source, SSIS will tell the source to slow down. You can easily test this by removing the destination and running the package. If suddenly your data flow is very fast, the destination is the problem.

Know the difference between ETL and ELT

Integration Services is a great tool and if designed correctly, performance can be extraordinary. However, you don't need to approach all problems with a hammer. For example, if the source and the destination are on the same destination, ELT probably will outperform SSIS. With ELT, the data is loaded into the destination (with SSIS or SP) and the transformation are done with stored procedures. Especially when SSIS resides on another server, ELT will be preferred because it will save you the network round trip between the servers. When you use the SSIS data flow, all data needs to be read into the memory of the server where package is running. If the server is another ne than the database server, you need to transfer all the data over the network, Twice.

Integration Services excels in transferring data from Server A to Server B. However, once the data in the destination, ELT probably is the best option to move forward. This doesn't mean you don't need to use SSIS anymore. The SSIS control flow is ideal for visualizing and orchestrating the flows of your data. Furthermore, it has the additional advantage of easily running tasks in parallel (which is not do easy in stored procedure).

Aside from data movement , some tasks are better done using SQL statements instead of using transformations in a data flow. The most obvious examples are sorting and aggregating data, as discussed in the section blocking components. Updating data is also better done outside the data flow in an Execute SQL Task. Think about the transformations you want to do and where they will be most efficient in the data flow or in the database engine.

Conclusion

If you follow a couple of design principles, Integration Services packages will perform just fine in most scenarios. This article presented to you these design principles along with some extra performance tips. It's not an exhaustive list, there are other performance tricks out there.

SQL Server Query Execution Plans for beginners – Clustered Index Operators

In this article we will continue discussing the various execution plan operators related to clustered indexes and what they do, when do the...