Data Points

SqlConnectionStringBuilder, DataView, and More

John Papa

Code download available at:DataPoints0601.exe(130 KB)

Contents

Binary Serialization
DataView Enhancements
Building Connections
What are Batch Updates?
Choose Carefully
Wrapping It Up

Several significant enhancements have been made to ADO.NET 2.0 in the areas of improved performance, increased flexibility, and added features. In my last column (Data Points: DataSet and DataTable in ADO.NET 2.0), I discussed the improvements made to the DataTable class, the effect of the powerful new LoadOption enumerator, how to change a row's state, and the new DataTableReader class. In this installment of Data Points, I will continue to explore significant enhancements that are introduced with ADO.NET 2.0, such as the new SqlConnectionStringBuilder, improvements to the DataView, batch updates, and improved DataSet serialization. I will also demonstrate how some of its new features perform when they are put to the test.

Binary Serialization

ADO.NET 1.x provides the means to serialize a DataSet. The bad news is that the XML format used for DataSets can cause performance problems when a large DataSet is serialized and passed over the network. In ADO.NET 2.0, you can serialize a DataSet or a DataTable in true binary format, resulting in smaller consumption of memory and network bandwidth.

Let's take a look at how to implement binary serialization of a DataTable. The RemotingFormat property of the DataTable is a new property introduced in ADO.NET 2.0 that can be used to set the serialization format. The RemotingFormat property expects one of the SerializationFormat enumeration values: SerializationFormat.Xml or SerializationFormat.Binary. Once RemotingFormat has been set, you can create an instance of the BinaryFormatter class to serialize the DataTable. The example in Figure 1 loads a list of customers into a DataTable and serializes it into the binary format.

Figure 1 Binary Serialization of a DataTable

string cnStr = @"server=.;database=northwind;integrated security=true;"; using (SqlConnection cn = new SqlConnection(cnStr)) { string sql = "SELECT o.OrderID, o.CustomerID, o.OrderDate, " + "od.ProductID, p.ProductName, od.UnitPrice, " + "od.Quantity FROM Orders o " + "INNER JOIN [Order Details] od " + "ON o.OrderID = od.OrderID INNER JOIN Products p ON " + " p.ProductID = od.ProductID "; SqlCommand cmd = new SqlCommand(sql, cn); SqlDataAdapter adpt = new SqlDataAdapter(cmd); DataTable dtCustomers = new DataTable("Customers"); adpt.Fill(dtCustomers); BinaryFormatter bin = new BinaryFormatter(); using (StreamWriter sw = new StreamWriter(@"c:\customers.bin")) { dtCustomers.RemotingFormat = SerializationFormat.Binary; bin.Serialize(sw.BaseStream, dtCustomers); } }

A DataTable that is serialized in binary format is generally smaller in size than the same DataTable serialized in XML format. But how much smaller?

Being the skeptic that I am, I tried a few tests to see how different the size of a DataSet would be when serialized in XML versus binary format. I tried several different sample sets using several different powers of 10. For my tests, I created a table called MyOrders and I added over 100,000 rows to it. Then I ran the code in Figure 2, which produced the results shown in Figure 3. (Your results may vary depending on your data.)

Figure 3 XML vs. Binary Serialization Payloads

Rows Binary KB XML KB Binary KB/ Row XML KB/Row
1 9.1 2.15 9.1 2.15
10 9.67 5.38 0.967 0.538
25 10.5 10.6 0.42 0.424
100 14.9 37.3 0.149 0.373
1,000 68.2 389 0.0682 0.389
10,000 602 3596 0.0602 0.3596
100,000 5934 36137 0.05934 0.36137

Figure 2 Serializing a DataTable

string cnStr = @"server=.;database=northwind;integrated security=true;"; for (int power = 0; power <= 5; power++) { using (SqlConnection cn = new SqlConnection(cnStr)) { double top = (Math.Pow(10, power)); string sql = "SELECT TOP " + top + " OrderID, CustomerID, OrderDate, " + "ProductID, ProductName, UnitPrice, Quantity" + "FROM MyOrders "; SqlCommand cmd = new SqlCommand(sql, cn); SqlDataAdapter adpt = new SqlDataAdapter(cmd); DataTable dtMyOrders = new DataTable("MyOrders"); adpt.Fill(dtMyOrders); BinaryFormatter bin = new BinaryFormatter(); using (StreamWriter sw = new StreamWriter( @"c:\myorders" + top + ".bin")) { dtMyOrders.RemotingFormat = SerializationFormat.Binary; bin.Serialize(sw.BaseStream, dtMyOrders); } using (StreamWriter sw = new StreamWriter( @"c:\myorders" + top + ".xml")) { dtMyOrders.RemotingFormat = SerializationFormat.Xml; bin.Serialize(sw.BaseStream, dtMyOrders); } } }

One thing that stands out is that the tests I ran for 1 and for 10 rows shows that the binary files are actually larger than the XML files. To figure out the number of rows where the crossover occurred, I ran my test an additional 100 times for rows 1 through 100. For my data sampling, the binary file was smaller than the XML file whenever there were 25 or more rows, so I added the results for 25 rows to the grid in Figure 3. The bottom line here is that the binary format can significantly decrease the amount of network bandwidth and time that is needed to pass a DataTable or a DataSet across a network.

Taking this one step further, I added two additional columns to the table in Figure 3 to represent the amount of memory consumed per row. Looking at the XML serialization data and starting at 100 rows and greater, notice how the amount of space required per row doesn't vary all that much (0.375 versus 0.389). However, the memory required for each row by the binary serialization method decreases significantly as the number of rows increase (0.149KB at 100 rows all the way down to 0.059KB at 100,000 rows).

It is important to keep in mind that it is still good practice to pass the fewest rows necessary in your application. If you are passing data to a data layer, you should only pass the changed data (inserts, updates, and deletes) using the GetChanges method of the DataTable, since there is rarely a need to pass down data that did not change. However, if your application requires that you pass an entire stateful set of data between tiers or possibly to another business system, binary serialization can come in handy.

DataView Enhancements

A DataView is a great tool for filtering, sorting, and in general just presenting data to the user in a more user-friendly format. In ADO.NET 2.0, the DataView class provides several new features that truly enhance its value. Let's assume, for example, that in ADO.NET 1.x you have a DataTable filled with a list of customers that came from a database. The DataTable is being used to store the list of customers and then a DataView is used on top of the DataTable to present the user with the list of customers in a sorted grid. In this example, the DataView is being used as a filter to present data. Going a step further, you might want to grab a distinct list of cities from the customer data to load another grid or a dropdown list. But if you also wanted to persist the DataView of customers or the DataView of the cities, you could not do this directly from the DataView. That is no longer the case in ADO.NET 2.0.

Through the DataView.ToTable method, the DataView now allows you to create a DataTable directly from the DataView. The new DataTable is a separate and distinct object that is not linked to the original DataTable. For example, let's assume I have a DataTable called dtCust and I create a DataView called view that displays a filtered and sorted list of the customers in the dtCust. If I then invoke the ToTable method on view to create a second DataTable called dtCities, I now have two DataTable objects that each store their own separate copies of their data. So if I change a customer in dtCust, it does not affect the data in dtCities whatsoever.

The code in Figure 4 creates a DataTable called dtCust and loads it with a list of customers from a database. It then creates a DataView called vwCust that only contains a distinct list of the cities of the USA customers. The DataView's ToTable method is then used to create a new DataTable called dtCities. This example uses the ToTable method's overloaded signature which accepts the name of the new DataTable to create, a Boolean value to indicate whether or not distinct values should be grabbed, and a string array containing the list of columns to grab from the DataView and add to the new DataTable.

Figure 4 Using ToTable To Create a DataTable from a DataView

string cnStr = @"server=.;database=northwind;integrated security=true;"; string sql = "SELECT CustomerID, CompanyName, City, Region, Country " + "FROM Customers ORDER BY CustomerID"; DataTable dtCust = new DataTable(); using(SqlConnection cn = new SqlConnection(cnStr)) { SqlCommand cmd = new SqlCommand(sql, cn); SqlDataAdapter adpt = new SqlDataAdapter(cmd); adpt.Fill(dtCust); } DataView view = new DataView(dtCust); view.RowFilter = "Country='USA'"; view.Sort = "City"; DataTable dtCities = view.ToTable("CustomerCities", true, "City", "Country");

Building Connections

ADO.NET 2.0 introduces a brand new class called SqlConnectionStringBuilder that can be used to help build a connection string. All you have to do is instantiate it and set the properties that are appropriate, and it builds the connection string for you (revealed through its ConnectionString property). The following code sample shows how the SqlConnectionStringBuilder works:

SqlConnectionStringBuilder cnBldr = new SqlConnectionStringBuilder(); cnBldr.DataSource = "MyServer"; cnBldr.InitialCatalog = "Northwind"; cnBldr.IntegratedSecurity = true; using (SqlConnection cn = new SqlConnection(cnBldr.ConnectionString)) { ... }

In this example, I set the server name via the DataSource property, the database name via the InitialCatalog property, and I tell it to use integrated security. IntelliSense® makes this approach much easier than attempting to hand code a connection string, and the strongly typed nature of the class helps eliminate harmful typos. There is no longer a need to try to remember the name of the attribute for the connection string. The Microsoft® .NET Framework also comes with the System.Data.OleDb.OleDbConnectionStringBuilder, System.Data.Odbc.OdbcConnectionStringBuilder, and System.Data.OracleClient.OracleConnectionStringBuilder classes. All of these are strongly typed classes that inherit from the DBConnectionStringBuilder class. Of course, if your database provider does not offer a strongly typed connection builder class, you could create your own class to build connections by inheriting from the DBConnectionStringBuilder class.

What are Batch Updates?

One of the best new features of ADO.NET is batch updating. Batch updating is easy to implement and reduces the number of network round-trips between your data layer and a database. To demonstrate the value of batch updates, I'll walk through a common update scenario that uses ADO.NET 1.x without batch updates and then discuss one that uses ADO.NET 2.0 with the new batch updating functionality.

First, I'll create a DataTable and load it with 100 rows from a database and present it to a user via a grid. I'll then add 12 rows, modify 9 rows, and delete 11 rows from the DataTable. At this point I have a DataTable with 70 unchanged rows, 9 modified rows, 11 deleted rows, and 12 added rows. I then use the GetChanges method to grab a new DataTable containing only the deltas (the modified, added, and deleted rows). I only want the deltas because I see no reason to pass all of the rows to my data layer throughout the layers of my application. In the data layer, I then create a DataAdapter and assign to the InsertCommand, UpdateCommand, and DeleteCommand properties the appropriate SQL stored procedures. When the DataAdapter's Update method is executed against the delta DataTable, the UpdateCommand is executed for each of the modified rows found using that row's data. Likewise, when it sees a deleted row or an added row, the DeleteCommand and InsertCommand are used, respectively. The DataAdapter.Update method executes 30 stored procedure calls (1 for each dirty row) and makes 30 separate round-trips between the data layer and the database.

In this example, using ADO.NET 1.1 techniques, each execution of a stored procedure packages the stored procedure call and sends it to the database, waits for any returning information, and then executes the next stored procedure in succession. If my data layer and my database are on separate physical servers, I have just introduced 30 round-trips across the network. If I had 100 changed rows, there would be 100 round-trips. You get the idea.

This same example with ADO.NET 2.0 reduces the network round-trips by using batch updates. Just prior to executing the DataAdapter.Update method, I set the UpdateBatchSize property of the DataAdapter to 30. This tells the DataAdapter to send 30 command executions at a time to the database. This time when the DataAdapter.Update method executes and iterates through the delta rows, it prepares the stored procedures to be executed until it has 30 of them prepared. Then it sends the 30 commands to the database in a single RPC request. In this example, all 30 stored procedure calls are sent across the network and then executed individually on the database server. Thus, only one network round-trip is required which streamlines this scenario.

For demonstration purposes, I built a consolidated example of a single batch of code (shown in Figure 5) that emulates the scenario I just described. In a real-world scenario, I would not recommend putting all of this code into a single method, of course. In this example, I retrieve some customers, modify one of them, and add a new customer to the DataTable. Then I set up the DataAdapter and its commands using SQL statements. At this point I set the UpdateBatchSize to 10, which in effect tells the DataAdapter to pool its commands and send them in groups of 10 (or less in case there are less than 10 operations to be performed).

Figure 5 Implementing Batch Updates

string cnStr = @"server=.;database=northwind;integrated security=true;"; string sql = "SELECT CustomerID, CompanyName, City, Region, Country " + "FROM Customers ORDER BY CustomerID"; using (SqlConnection cn = new SqlConnection(cnStr)) { SqlCommand selCmd = new SqlCommand(sql, cn); SqlDataAdapter adpt = new SqlDataAdapter(selCmd); DataTable dtCustomers = new DataTable("Customers"); adpt.Fill(dtCustomers); dtCustomers.PrimaryKey = new DataColumn[] { dtCustomers.Columns["CustomerID"] }; // Add a customer row DataRow newRow = dtCustomers.NewRow(); newRow["CustomerID"] = "FOO"; newRow["CompanyName"] = "The Foo Company"; newRow["City"] = "Fooville"; dtCustomers.Rows.Add(newRow); // Change a customer row DataRow row = dtCustomers.Rows.Find("ALFKI"); row["City"] = "Here"; /// was Berlin // Set up the Insert Command string insSql = "INSERT Customers (CustomerID, CompanyName, City) " + "VALUES (@CustomerID, @CompanyName, @City)"; SqlCommand insCmd = new SqlCommand(insSql, cn); insCmd.UpdatedRowSource = UpdateRowSource.None; insCmd.Parameters.Add("@CustomerID", SqlDbType.NChar, 10, "CustomerID"); insCmd.Parameters.Add("@CompanyName", SqlDbType.NChar, 80, "CompanyName"); insCmd.Parameters.Add("@City", SqlDbType.NChar, 30, "City"); adpt.InsertCommand = insCmd; // Set up the Update Command string updSql = "UPDATE Customers SET CompanyName = @CompanyName, " + "City = @City WHERE CustomerID = @CustomerID"; SqlCommand updCmd = new SqlCommand(updSql, cn); updCmd.UpdatedRowSource = UpdateRowSource.None; updCmd.Parameters.Add("@CompanyName", SqlDbType.NChar, 80, "CompanyName"); updCmd.Parameters.Add("@City", SqlDbType.NChar, 30, "City"); updCmd.Parameters.Add("@CustomerID", SqlDbType.NChar, 10, "CustomerID"); adpt.UpdateCommand = updCmd; // Set the Batch Size. 0 means all; default is 1. adpt.UpdateBatchSize = 10; // Send the new rows and the changed rows to the database adpt.Update(dtCustomers.GetChanges()); }

So the idea is to group multiple commands and package them for transmission across the network to the database server. By default the UpdateBatchSize is set to 1, which means it acts just like it did in ADO.NET 1.1, where each delta row causes a single command to be sent to the database on its own, one at a time. When using an explicit value, such as 10, the DataAdapter packages 10 commands and sends them in a single RPC call to the database server. If there are 15 modified, added, and deleted rows in total and if the UpdateBatchSize is set to 10, there will be 2 separate packages transmitted: the first with 10 rows and the second with the remaining 5 rows. If the UpdateBatchSize is set to 0, the DataAdapter will attempt to send all of the command executions in a single package. If batch updates are not supported, then the UpdateBatchSize is automatically reverted back to 1.

Batch updates are implemented by packaging everything into a single RPC request. So if there are six commands to execute against the database contained in the batch, they will all be sent in a single request. The RPC request will contain each SQL command separated by an RPC marker. This is important because if you watch SQL Profiler after running the example in Figure 5 you will see two commands executing against the database individually. This technique is also better than some other techniques that limit the number of parameters or the SQL command length. The first time I tried batch updates, I thought everyone was nuts because I watched SQL Profiler and saw all of the separate SQL commands being executed. It wasn't until I used a network sniffer that I could clearly see that only one round-trip was taking place.

Choose Carefully

There is a time and place for everything, and batch updates are no exception. Batch updates do not support the use of return or output parameters, so you will want to avoid using batch updates if you need return or output parameters. Batch updates also affect the RowUpdated event because they will cause the RowUpdated event to fire only once for each batch even though so many updates were made. Figure 6 shows some differences in the number of times the RowUpdated event handler would execute based upon the UpdateBatchSize when 25 rows are updated.

Figure 6 RowUpdated Event Execution on 25 Rows

UpdateBatchSize Number of Times RowUpdated Fires Comments
0 1 Once for the entire batch
1 25 Once per row
4 7 Once for every 4 rows: Ceiling(25/4)
10 3 Once for every 10 rows: Ceiling(25/10)
25 1 Once for the entire batch
30 1 Once for the entire batch

When using batch updates, keep in mind that the Row property that is exposed in the RowUpdated event handler will represent the last row in the batch. If the count of actual rows in the batch is smaller than the UpdateBatchSize, the Rows property will return null.

Wrapping It Up

The release of ADO.NET 2.0 brings several new enhancements to its core classes along with new classes, some of which bring better performance and others which yield greater functionality and flexibility. With all the changes you'll find in ADO.NET 2.0, it's a good idea to try the features out a little at a time so you can refine your database toolbox and be ready to get the best performance at the best price for your data-driven apps. Now go out and reap the benefits of ADO.NET 2.0.

Send your questions and comments for John to  mmdata@microsoft.com.

John Papa is a Senior .NET Consultant with ASPSOFT (aspsoft.com) and a baseball fanatic who spends most of his summer nights rooting for the Yankees with his family and his faithful dog, Kadi. He has authored several books on ADO, XML, and SQL Server, and can often be found speaking at industry conferences, such as VSLive, or blogging at codebetter.com/blogs/john.papa.