SQL Server

Friday, 15 May 2020

Integration Services Perfromance Best Practices - Data Flow Optimization

In this article, we will present a couple of common best practices regarding the performance of Integration Services (SSIS) packages. These are general guidelines for package design and development which guarantee a good performance in most use cases. However, there are always exceptional cases and outliers. Thoroughly test any changes to your packages tp conclude that a change made a positive effect. This means not only running in the designer, but also on the server. Also, this article is not an exhaustive list of all possible performance improvements for SSIS packages. It merely represents a set of best practices that will guide you through the most common development patterns.

Let's get started with data flow performance

Use a SQL statement in the source component

Well, this only applies if your source supports SQL. For a flat file you just need to select the columns you need. The idea behind this statement is that when you write a SQL statement that will already optimize the data for SSIS data flow:

You only select the columns needed.
You already perform the necessary data type conversions using CAST or CONVERT
You can already perform complex logic such as joining different tables together
You can filter the data
You can sort the data if needed

Whatever you do, don't use the dropdown box to select a table. There's no logic you can add it, it will select all columns and all rows of the table and it's actually slower than using a SQL statement (as it uses OPENQUERY behind the scenes).

Chang the Data Access Mode to SQL Command or SQL Command from variable. The second option is useful if you want to dynamically build your query using expressions.

Get as many rows as you can into a buffer

The SSIS data flow uses the memory buffers to manage the data flowing through the data flow. It's very important that you can get as many rows into a single buffer. Imagine a line of people passing buckets to put a fire. The more water you can get into a bucket, the quicker you can out a fire. To achieve this goal, you can take actions:

Only select the columns you need. If you bring unnecessary columns into the data flow, you waste precious buffer space.
Keep the columns as small as possible. If the total row length is smaller, you can fit more rows into a buffer. Try converting the columns to a smaller size in the source if possible. The data flow takes the maximum size of each column to calculate the total row length. For example, if a column has the size varchar(4000), the data flow will assume a length of 4000 bytes, even if the actual rows contain much less data.

Don't use the default buffer settings

While we are on the topic of buffers, let's look at some data flow properties:

DefaultBufferSize: the default size of a buffer which is set to 10MB.
DefaultBufferMaxRows: the default number of rows the data flow will try to put in a buffer, This is set to 10,000 rows.

When calculating the buffer size, the data flow uses these two properties. If one property is reached, the size of the buffer is not further enlarged. For example, if the row size is 1MB, the data flow will put 10 rows into the buffer, since the maximum buffer size will then be met. If the row size is only 1KB, the data will se the maximum 10,000 rows.

These two default stings haven't been changed since the release of SSIS in 2005. Apparently, a time where memory was still crazy expensive. The first thing you'll want to do is set these properties to a much higher value. There is no golden number that will guarantee optimal performance all the time. Sometimes a smaller buffer is better the source is really slow, sometimes a bigger buffer is better if you don't have many columns and very fast source. Keep in mind that buffers can be too big as well. Think about the line of people passing buckets. If the buckets are too big, it takes too long to fill them with water and the fire still rages on. In reality, it takes some testing to find the optimal settings for your data flow. Typically I run my package wit default settings and monitor the execution time. Then I enlarge the buffers to 30MB and 50,000 rows for example and see what effect it has. And then I try to determine if the package is better off with even larger buffers or if they need to be a bit smaller.

Luckily, all this testing can now largely be avoided by one new property introduced in SQL Server 2016: AutoAdjustBufferSize. When this property is set to True, it will automatically adjust the size of the buffer so that the DefaultBufferMaxRows setting is met. For example, if you have set it to 40,000 rows, the data flow will automatically scale the buffer size so that the buffers will contain 40,000 rows. Unfortunately this property is set to False by default.

In conclusion make those buffers bigger. This can really have a drastic impact on the performance.

Avoid blocking transformations

Even if you follow the previous tips to the letter, performance can still be terrible if you use a blocking transformation. These are the types of transformations in a data flow.

Non-blocking: The buffers are unchanged and flow through the transformation. The data flow performs the transformation on a memory buffer and moves on to the next. Examples are: derived column, data type conversion, conditional split etc.
Semi-blocking: The buffer can change in size; either in the number of columns or in the number of rows. However, there is no blocking behavior. Examples a re Union All, Merge and Merge Join
Blocking: The bane of SSIS performance. These transformations need to read all the buffers before they can output even one single buffer. This can lead to memory pressure which causes the SSIS package to spill to disk. In other words, your package runs now for hours instead of minutes or seconds. Examples are Sort and Aggregate

All transformations can be divided into two categories: synchronous and asynchronous. Synchronous means the buffer doesn't change in size at all. All non-blocking transformations are synchronous. Sometimes this isn't obvious: the Multicast component creates multiple output paths, which seems multiple buffers are being created. However, behind the scenes it's still the same memory buffer. Asynchronous components do change the size of the buffer Columns can be added or removed. Rows can be added or removed as well. All blocking and semi-blocking transformations are asynchronous.

The rule is simple: avoid asynchronous transformations. In most cases, these are blocking anyway. The Union All has the benefit of the doubt; it's the bad asynchronous transformation. However, if you can design your package so you can avoid it, it's recommended you do. The Merge and Merge Join are to avoided as well, because they require sorted input. You cam sort in the data flow, but the Sort component is a blocking transformation, so it needs to be avoided. An alternative is to sort in the source component, for example by adding an ORDER BY clause to the SQL statement. The data flow needs to know the data is sorted thought. You can do this by setting the IsSorted property to true on the output in the Advanced Editor.

You also need to indicate which columns are sorted. Keep in mind that this doesn't sort the data for you. This is just metadata information.

Don't use the OLE DB command transformation

Don't use it, Ever. Unless you have really small data sets and you are 100% sure they wont grow. The OLE DB command executes SQL Statement for every row in the buffer. This means if 1 million rows pass through the transformation, the SSIS package will sent 1 million statements on the server. Most likely the transaction log will blow up nd it might even fill your disks completely.

A work around is to write the records to a staging table and then use a Execute SQL Task to do a set-based SQL statement. In other words, use the Execute SQL Task to run single UPDATE statement instead several thousand UPDATE statements.

Use the FastLoad option

This tip only applies when you are using the OLE DB Destination and if your destination supports the fast load option. It's very common through to have SQL Server as the destination. In that case, use the fast load option. If not, you are inserting records row by row which has a devastating impact on the transaction log.

The default settings work fine in general. You can tweak them a bit for a specific need though. A good use case is when loading data to a clustered column store index. Here you want that only batches of at least 102,400 rows (preferably over a million) are sent to SQL Server. In this case, the batches end up immediately in a compressed row group instead of the delta store (which is row based and thus slower).

In most cases, the OLE DB Destination - when writing to SQL Server using the Native Client - outperforms all other alternatives. The SQL Server Destination is supposed to be faster, but it has some limitations and nowadays the OLE DB Destination is the best option. OLE DB is deprecated though by Microsoft. This means you can still use the OLE DB connection managers, but there are no new features being released. For example, AlwaysEncrypted is supported by ADO.NET but not by OLE DB.

Prepare your Destination

Just as you need to prepare your source (for example by putting the right indexes into place to speed up reading), you need to prepare your destination. Here are some examples of how you can make your destination go faster (most apply when the destination is SQL Server):

Put the destination on a fast disk. SSDs are preferred.
For large loads, you can disable constraints on the table and re-enable them after the load. The same applies for indexes: you can drop them first and recreate them after the SSIS package has finished. This avoids fragmentation on your indexes.
If you write to a database make sure the database files have the appropriate size. You don't want them to be automatically growing all the time (especially not with the default settings). Estimate what your total data volume is and how it is going to grow in the coming months. Then size your files accordingly.

If possible, try to load in parallel. You'll probably need to take precautions to avoid locking and blocking. Try to avoid a hot spot in your clustered index: all the load processes try to insert the last page of the clustered index at the same time. The Balanced Data Distributor can help you to parallelize you data flow. This component is a separate download for SSIS 2012 and 2014, but is built in into SSIS 2016

Go for as minimal logging as possible. Set the recovery model to Simple, if your disaster recovery policy allows it. Do bulk inserts (using the fast load options) and configure them to be minimally logged.

Remember if your destination is slow, your source will be slow as well due to backpressure in the data flow. In other words, if your destination can't keep up the with the source, SSIS will tell the source to slow down. You can easily test this by removing the destination and running the package. If suddenly your data flow is very fast, the destination is the problem.

Know the difference between ETL and ELT

Integration Services is a great tool and if designed correctly, performance can be extraordinary. However, you don't need to approach all problems with a hammer. For example, if the source and the destination are on the same destination, ELT probably will outperform SSIS. With ELT, the data is loaded into the destination (with SSIS or SP) and the transformation are done with stored procedures. Especially when SSIS resides on another server, ELT will be preferred because it will save you the network round trip between the servers. When you use the SSIS data flow, all data needs to be read into the memory of the server where package is running. If the server is another ne than the database server, you need to transfer all the data over the network, Twice.

Integration Services excels in transferring data from Server A to Server B. However, once the data in the destination, ELT probably is the best option to move forward. This doesn't mean you don't need to use SSIS anymore. The SSIS control flow is ideal for visualizing and orchestrating the flows of your data. Furthermore, it has the additional advantage of easily running tasks in parallel (which is not do easy in stored procedure).

Aside from data movement , some tasks are better done using SQL statements instead of using transformations in a data flow. The most obvious examples are sorting and aggregating data, as discussed in the section blocking components. Updating data is also better done outside the data flow in an Execute SQL Task. Think about the transformations you want to do and where they will be most efficient in the data flow or in the database engine.

Conclusion

If you follow a couple of design principles, Integration Services packages will perform just fine in most scenarios. This article presented to you these design principles along with some extra performance tips. It's not an exhaustive list, there are other performance tricks out there.

Thursday, 14 May 2020

Create a linked server to an Azure SQL database via SSMS

This article will explain how to create and configure linked server to retrieve data from an Azure SQL database. Also will explain how to solve common problems/issues during the process of creating a linked server to an Azure SQL database.

To create a linked server via SSMS, go to the Object Explorer under the Server Objects folder, right click on the Linked Servers folder and from the context menu choose the New Linked Server command.

The New Linked Server window will be opened

In this window, the first thing that needs to be done is to enter a name for the linked server in the Linked server box and to choose the server type by clicking the SQL Server or Other data source radio button under the General tab.

Under the Security tab, select the Be made using this security context radio button and enter user credentials that exist on Azure server.

After entering the user credentials press the OK button to create a linked server to an Azure SQL database. Now under the Linked Servers folder, Azure linked server that we created will appear and in the Catalogs folder, all available databases will be listed.

But when expanding a particular (e.g. TestDatabase) is needed in order to see tables of the database the following error message will appear.

This error occurs because Azure does not allow to alter the master details. To resolve this, you need to connect directly to Azure database you are going to use.

Delete the SQL Azure linked server that we created and create a new one.

This time, under the Server type section of the General tab, choose the Other data source radio button. The name for the linked server in the Linked server text box can be, this time whatever you like (e.g. AZURE SQL DATABASE). Under the Provider drop down box choose the Microsoft OLEDB Provider SQL Server item. In the Data source text box, enter the name of the SQL Azure (e.g. server.database.windows.net). The most important setting in order to correctly create a linked server to an Azure SQL database is to enter the name in the Catalog text box (e.g. TestDatabase) of an Azure SQL database for which you want to create a linked server to an Azure SQL database. Otherwise, if this field is left empty, we will encounter the name 40515 error when trying to get a list of the tables under the Catalogs folder.

Under the Security tab, use the same setting that we used in the previous example and press the OK button. This will create a linked server to an Azure SQL database (TestDatabase database) and when the plus(+) sign next to the Tables folder is pressed, the Tables folder will expand and show all tables for the TestDatabase database:

To retrieve data from the SQL Azure table (e.g. CustomerAddress), type the following code:

If everything goes well, the data from the CustomerAddress table will appear in the Results grid.

Now if you try to execute a stored procedure from the AZURE SQL DATABASE linked server, the following message may appear:

This is because the RPC and RPC Out (Remote Procedure Call) options that allow remote procedures to be called from the linked server or to be called to the linked server by default set to false.

To fix that, right click on the AZURE SQL Database linked server, choose the Properties option:

Under the Server Options tab, set the RPC and RPC Out options to True:

Or in a query editor, paste and execute the following code:

OPENQUERY Function

The OPENQUERY function can be used to get data from the linked server by executing code like this:

The result will be the same as from the example above.

This is the ad hoc method for connection to a remote server using the linked server and querying data from a remote server. If the connection to remote server is frequently used, the using the linked server is better solution instead of using the OPENQUERY function.

This function can be used in the FROM clause of the INSERT, SELECT, DELETE or UPDATE statement.

Create a linked server to an Azure SQL database using Transact-SQL

In order to create a linked server to an Azure SQL database, type the following code in a query editor:

Connecting to Azure SQL database and querying data using the distributed queries

To connect to Azure SQL database and access data without creating a linked server first, use the T-SQL OPENROWSET or OPENDATASOURCE functions.

To open a connection and querying data from the Azure SQL database using the OPENROWSET function type the following code in a query editor:

If, for some reason, the above code does not work, use the code below to connect and query data from Azure SQL database.

Another way of connecting and querying data from Azure SQL database is by using the OPENDATASOURCE function.

In a query editor, paste and execute one of the following codes:

Common error that may occur when using the T-SQL OPENROWSET and OPENDATASOURCE functions:

To resolve this the Ad Hoc Distributed Queries option should be enabled. To enable the Ad Hoc Distributed Queries option, use the sp_configure procedure and in a query editor, paste and execute the following code.

Wednesday, 31 October 2018

Steps for Installing SQL Server AlwaysOn Availablity Groups

An availability group supports a fail over environment for a discrete set of users databases, know as availability databases, that fail over together. An availability group supports a set of primary databases and one to eight sets of corresponding secondary databases. Secondary databases are not backups. Continue to backup your databases and their transaction logs on a regular basis.

Prerequisites required to enable SQL Server 2012 AlwaysOn Availability Groups Features

Get the OS installed, patched and configured on all participating node.
See either Windows Update or an internal Windows Server update Services (WSUS) server to get all the required Windows Updates downloaded and installed.
Dedicated domain user account be created for use by the SQL Server service. This should just be a regular or domain account.
Having separate accounts for SQL Agent service, SSAS, SSIS & SSRS. Having separate account is more secure and resilient, since a problem with one account won't affect all the SQL Server services.
Ensure password is not temporary and there is complex password.
Both SQL & OS editions, versions should be at same level on all participating nodes.
All replicas in your AlwaysOn group must be in the same windows domain. They must be able to communicate with each other.
Always on availability groups is only supported in Enterprise edition starting from SQL Server 2012 (except SQL Server 2016 it supports basic availability group in standard edition)
Recommend to have same collation on all replicas
SQL Server ports must be opened at firewall level for communication between replicas
Create shared network share on all participating node
Need to install .NET framework 3.5.1 or greater on all participating node
Make sure the databases are in FULL recovery mode
Databases included in AlwaysOn group must be user databases. System databases cannot participate in AlwaysOn Availability group
Read only databases cannot belong to an AlwaysOn group
Databases may only belong to one availability group at a time
Make sure full backups of each databases are made prior installing AlwaysOn
No cluster shared volume is required for Always On, it can be configured in local disks
Make sure to have a separate NIC's for public and private communication
Additional NIC is required if want to isolate always on replication traffic to dedicated NIC
Make sure to have two free IP's each for windows cluster IP and Always on listener IP

AlwaysOn Availability Groups require a Windows Server Failover Cluster, we first need to add the Windows Failover Cluster Feature to all the nodes running the SQL Server instances that we will configure as replicas

We have two node windows failover cluster SQL1 & SQL2 already setup as shown in the below screenshot

Once you have installed failover cluster we can now proceed with enabling the AlwaysOn Availability Group features in SQL Server 2012. This needs to be done on all the SQL Server instance that you will configure as replicas in your Availablity Group

How to Enable SQL Server 2012 AlwaysOn Availability Groups Feature

Default standalone SQL instances installed on nodes SQL1 & SQL2

Step 1

Open SQL Server Configuration Manager. Double click the SQLServer (MSSQLSERVER) service to open the properties dialog box

In below screenshot, AONTESTWFCS01is the windows cluster name. SQL1 and SQL2 are nodes.

Step 2

Restart the SQL Server service

Step 3

To create and configure a SQL Server 2012 AlwaysOn Availability Group, Open SQL Server Management Studio. Connect to the SQL Server instance

In Object Explorer, expand the AlwaysOn High Availability folder, Right click on the Availability Groups folder and select the New Availability Group Wizard... option. This will launch the New Availability Group Wizard

Step 4

In the Introduction page, click Next.

In the Specify Availability Group Name page, enter the name of the Availability Group in the Availability group name: field. Click Next

Step 5

In the Select Databases page, select the checkbox beside the database that you want to include in your Availability Group. The databases have to be in Full recovery model prior to joining them in the Availability group. Click Next.

Step 6

In the Specify Replicas page, under the Replicas tab, click the Add Replicas button and connect to the other SQL Server instances that you joined as nodes in your Windows Server Failover Cluster.

Configure the following options:

Automatic Failover
Synchornous commit
Readable secondary

In the below screenshot, specified replicas SQL1 & SQL2.

Step 7
In the Endpoints tab, verify that the port number value is 5022 and endpoint name Hadr_endpoint

Step 8
In the below screenshot, Select backup preferences

Step 9
In the Listener tab, select the Create an availability group listener option. Enter the following details.

Listener DNS name: AGLIS01
Port: 1633
Listener IP Address: 192.168.35.18

Click the Add... button to provide an IP address. In the Add IP Address dialog box, enter the preferred virtual IP address in the IPv4 Address field. Click OK. Click Next.

Step 11
In the Validation page, verify that all validation checks return successful results. Click Next.

Step 12
In the Summary page, verify all configuration settings and click Finish. This will create and configure the AlwaysOn Availability Group and join the databases.