Data Points

DataSet and DataTable in ADO.NET 2.0

John Papa

Contents

DataTable Enhancements
Lightweight and Fast Looping
Loading Data
LoadOption Enumerator
Changing RowState
0 to 60
Wrapping It Up

ADO.NET 2.0 sports some exciting enhancements to the core classes found in ADO.NET 1.x and introduces a variety of new classes, all of which promise to improve performance, flexibility, and efficiency. There have even been some major changes throughout the lifecycle of the pre-beta and beta versions of ADO.NET 2.0, such as the improvements to the new batch updating process. With the final release of ADO.NET 2.0 fast approaching and the feature set becoming more stable, it's time to take a closer look at what's in store.

This month I will begin by exploring the improvements to the DataSet and DataTable classes, what they mean to you, and when you might want to use them. Sometimes in ADO.NET 1.x, especially when working with large rowsets, you can experience performance problems. I'll discuss how some of these performance issues have been addressed in ADO.NET 2.0 by changes to the indexing engine. I'll review the features that have been added to the DataTable class, as well as the loading options through the new Load method and the new methods that change a row's state. In the next installment of this column, I will discuss other improvements, such as the ability to perform batch updates, compressing DataSets for transport using binary serialization, and more.

DataTable Enhancements

In ADO.NET 1.x the DataSet got all the glory leaving the DataTable in the shadows. This is not to say that the DataTable was not a useful class on its own. The DataTable is the container for rows and columns and could be considered the focal point of all disconnected data. However, the DataSet gets most of the press because it can contain DataRelation and the DataTable objects.

However useful, the DataTable does have some limitations in ADO.NET 1.x that the DataSet does not. For example, the DataSet exposes a Merge method that can merge two DataTable objects within a DataSet, but the DataTable itself does not expose a Merge method. So if you had a DataTable on its own (not contained within a DataSet) and you wanted to merge it with another DataTable object, you would have to first create a DataSet object, put the first DataTable into it, and invoke the DataSet.Merge method, as I did in Figure 1.

Figure 1 Merging Two DataTable Objects in ADO.NET 1.x

string sqlAllCustomers = "SELECT * FROM Customers"; string cnStr = @"Data Source=.;Initial Catalog=northwind;Integrated Security=True"; using (SqlConnection cn = new SqlConnection(cnStr)) { cn.Open(); SqlCommand cmd = new SqlCommand(sqlAllCustomers, cn); SqlDataAdapter adpt = new SqlDataAdapter(cmd); DataTable dtCust1 = new DataTable("Customers"); adpt.Fill(dtCust1); dtCust1.PrimaryKey = new DataColumn[]{dtCust1.Columns["CustomerID"]}; DataTable dtCust2 = dtCust1.Clone(); DataRow row1 = dtCust2.NewRow(); row1["CustomerID"] = "ALFKI"; row1["CompanyName"] = "Some Company"; dtCust2.Rows.Add(row1); DataRow row2 = dtCust2.NewRow(); row2["CustomerID"] = "FOO"; row2["CompanyName"] = "Some Other Company"; dtCust2.Rows.Add(row2); DataSet ds = new DataSet("MySillyDataSet"); ds.Tables.Add(dtCust1); ds.Merge(dtCust2); dgTest.DataSource = dtCust1; }

It's not difficult by any means, but it is annoying. In ADO.NET 2.0, the DataTable object now has a Merge method so you can merge two DataTable objects, like so:

dtCust1.Merge(dtCust2);

Another inconvenience in ADO.NET 1.x is that you cannot perform the basic XML operations on a DataTable without first associating it with a DataSet. For example, if you want to write a DataTable to XML, you need to load the DataTable into a DataSet and use the DataSet's WriteXml method. However, in ADO.NET 2.0, the DataTable has a WriteXml method so the problem is solved. In addition to the WriteXml method, the DataTable in ADO.NET 2.0 also exposes the ReadXml, ReadXmlSchema, and WriteXmlSchema methods.

The DataSet also has some new methods and properties. In fact, both the DataSet and the DataTable now expose the RemotingFormat property as well as the Load and CreateDataReader methods. The RemotingFormat property is used to indicate whether to serialize the DataTable or DataSet in binary or XML format. The Load method can be used to load data into a DataTable or DataSet in a variety of ways, which I'll discuss shortly.

Lightweight and Fast Looping

The DataTable's CreateDataReader method (named GetDataReader in previous Beta versions) creates an instance of the ADO.NET 2.0 DataTableReader. A DataTableReader created using DataTable.CreateDataReader will expose the same rows and columns as the DataTable. When a DataTableReader is created from a DataSet or a DataTable's CreateDataReader method, the DataTableReader will contain all of the rows from the container object, with the exception of deleted rows.

The DataTableReader is a lighter weight object than the DataTable and, unlike the DataReader (SqlDataReader), the DataTableReader is disconnected. This is a great feature because you get the lightweight object that you can iterate through quickly (like the DataReader) and it is disconnected from any data source (unlike the DataReader). A DataTableReader can be thought of as an iterator over the underlying table's rows, similar to a foreach enumeration of the contents of the table. However, unlike enumerating a table's rows (which will cause an exception when a row is added or deleted from the collection during the enumeration), a DataTableReader is resilient to changes being made to the underlying table and will correctly position itself in light of any modifications that occur.

The following example shows how you can create a DataTableReader and bind it to a DataGridView:

using (SqlConnection cn = new SqlConnection(cnStr)) { SqlCommand cmd = new SqlCommand(sqlAllCustomers, cn); SqlDataAdapter adpt = new SqlDataAdapter(cmd); DataTable dtCustomers = new DataTable("Customers"); adpt.Fill(dtCustomers); DataTableReader dtRdr = ds.CreateDataReader(); dgvCustomers.DataSource = dtRdr; }

The DataTableReader can only be traversed in a forward manner, just like the DataReader. Again, like the DataReader, you move to the first row by using the Read method of the DataTableReader. If the DataTableReader was created from a DataSet that contained multiple DataTables, the DataTableReader will contain multiple resultsets (one per DataTable). Each subsequent resultset can be accessed using the DataTableReader by invoking the NextResult method (again, similar to the DataReader).

Figure 2 shows how to create a DataTableReader from a DataSet that contains two DataTable objects. Since the DataSet has two DataTables, the DataTableReader will contain two resultsets. You can loop through both resultsets in the DataTableReader using the Read and NextResult methods, as shown in Figure 2. Keep in mind that the DataTableReader only moves forward. So if you want to access the DataTableReader twice, because you want to loop through it again, you'll have to reload it after reading through all of its records the first time.

Figure 2 Looping Through a DataTableReader

using (SqlConnection cn = new SqlConnection(cnStr)) { // Create the Command and Adapter SqlCommand cmd = new SqlCommand(sqlAllCustomers, cn); SqlDataAdapter adpt = new SqlDataAdapter(cmd); // Create a DataTable and fill it DataTable dtCustomers = new DataTable("Customers"); adpt.Fill(dtCustomers); DataSet ds = new DataSet(); ds.Tables.Add(dtCustomers); adpt.SelectCommand = new SqlCommand("SELECT * FROM Orders", cn); adpt.Fill(ds, "Orders"); // Create the DataTableReader (it is disconnected) using(DataTableReader dtRdr = dtCustomers.CreateDataReader()) { do { Console.WriteLine("******************************"); while (dtRdr.Read()) { Console.WriteLine(dtRdr.GetValue(0).ToString()); } } while (dtRdr.NextResult()); } }

In Figure 2 the DataSet creates a DataTableReader using the CreateDataReader method. The order in which the DataTable's resultsets are added to the DataTableReader is the same as the order in which they appear in the DataSet. If you want to specify the order of the DataTable's resultsets, there is an alternate overloaded CreateDataReader method you can use.

Loading Data

A DataTableReader can also be used to populate a DataTable or a DataSet. In fact, using the new Load method of a DataSet or a DataTable, you can pass in a DataTableReader or any reader class that implements the IDataReader interface. The following example assumes there is a DataTable called dt1 that contains a schema and some rows. It creates a DataTableReader from the DataTable called dt1 and then turns right around and loads a second DataTable (called dt2) with the same data:

DataTableReader dtRdr = dt1.CreateDataReader(); DataTable dt2 = new DataTable(); dt2.Load(dtRdr);

In ADO.NET 1.x you could populate a DataSet or DataTable using the DataAdapter's Fill method. Alternatively, you could populate a DataSet from XML using the DataSet's ReadXml method. The introduction of the Load method in ADO.NET 2.0 makes it possible to load a DataSet or a DataTable from a class that implements IDataReader (like a DataTableReader or the SqlDataReader). When using the Load method to load several rows you can turn off notifications, index maintenance, and constraint checking by first invoking the BeginLoadData method, and then turn them back on by invoking the EndLoadData method. These methods, which are available in ADO.NET 1.x as well, can make the loading of data faster since ADO.NET does not have to stop after each row and recompute its indices, invoke notifications, or check constraints. Just make sure you remember to turn these things back on.

LoadOption Enumerator

The Load and the Fill methods have an overload that accepts one of the values from the LoadOptions enumeration. These settings introduce powerful new abilities which can be used to indicate how rows should replace existing rows during a Fill or Load operation into a DataSet or DataTable. This process assumes that a primary key has been set since it uses the primary key to determine how to replace or append the rows. These enumerator values help determine whether or not the current value and/or the original value get overwritten with the incoming rows' values. Figure 3 shows the three options and brief descriptions of each.

Figure 3 LoadOption Enumerator Settings

Setting Description
PreserveChanges This is the default setting. Keeps the current values. Overwrites the original values with the incoming rows.
Upsert Overwrites the current values with the incoming rows. Keeps the original values.
OverwriteChanges Overwrites the current values with the incoming rows. Overwrites the original values with the incoming rows.

Each of these options has its place in an application given an appropriate circumstance. The OverwriteChanges option works well if you have a DataTable that contains data but you want to get any changed values that may exist in the database. Using this option will overwrite both the original and the current values in the DataTable with the values that come from the database. The key here is that when using OverwriteChanges, any data that has been modified (original version or current version) in the first DataTable will be overwritten while new rows will be added.

DataSet columns store an original and a current value. PreserveChanges will keep the current value intact while overwriting the original value. Upsert does the opposite of this as it keeps the original value intact while overwriting the current value.

Here is an example of when PreserveChanges might come in handy. Suppose a user named Peggy has opened a screen and loaded a DataGrid with customers from a DataSet. Peggy modifies the city of the customer with CustomerID ALFKI from Berlin to New York, but doesn't click the Save button. She then goes off for a cup of coffee. Meanwhile, Katherine modifies the same customer's city from Berlin to Miami. You'll now have a data concurrency issue if Peggy comes back from her break and saves the record. So in this situation, the original value for Peggy's customer record was Berlin and since she changed it to New York the current value is New York. Meanwhile, in the database the city is now Miami. If you want to reset the original values of Peggy's DataSet to what is in the database, you could get the data from the database into a DataTableReader and then load it into the DataSet using LoadOptions.PreserveChanges. Figure 4 illustrates how this works.

Figure 4 Testing LoadOptions

using (SqlConnection cn = new SqlConnection(cnStr)) { SqlCommand cmd = new SqlCommand(sqlAllCustomers, cn); SqlDataAdapter adpt = new SqlDataAdapter(cmd); DataTable dtCustomers = new DataTable("Customers"); adpt.Fill(dtCustomers); dtCustomers.PrimaryKey = new DataColumn[] { dtCustomers.Columns["CustomerID"] }; // Pause here to execute this SQL directly against the database: // UPDATE Customers SET City = 'Miami' WHERE CustomerID = 'ALFKI' System.Diagnostics.Debugger.Break(); // Change the CURRENT values in the DataTable DataRow row = dtCustomers.Rows.Find("ALFKI"); row["City"] = "Somewhere"; // ORIGINAL city == Berlin // CURRENT city == New York DisplayDataRowVersions(row, "Immediately after I change the City" + " from Berlin to New York", "City"); // Load another DataTable with customer data. DataTable dtCustomers2 = new DataTable("Customers"); adpt.Fill(dtCustomers2); DataTableReader dtRdrCustomers = dtCustomers2.CreateDataReader(); LoadOption opt = LoadOption.PreserveChanges; switch (opt) { case LoadOption.OverwriteChanges: // Overwrite ORIGINAL and Overwrite CURRENT values dtCustomers.Load(dtRdrCustomers, LoadOption.OverwriteChanges); // ORIGINAL city == Miami // CURRENT city == Miami row = dtCustomers.Rows.Find("ALFKI"); ShowVersions (row, "Immediately after LoadOptions.OverwriteChanges", "City"); break; case LoadOption.Upsert: // Keep ORIGINAL and Overwrite CURRENT values dtCustomers.Load(dtRdrCustomers, LoadOption.Upsert); // ORIGINAL city == Berlin /// CURRENT city == Miami row = dtCustomers.Rows.Find("ALFKI"); ShowVersions (row, "Immediately after LoadOptions.Upsert", "City"); break; case LoadOption.PreserveChanges: // Overwrite ORIGINAL and Keep CURRENT values dtCustomers.Load(dtRdrCustomers, LoadOption.PreserveChanges); // ORIGINAL city == Miami // CURRENT city == New York row = dtCustomers.Rows.Find("ALFKI"); ShowVersions (row, "Immediately after LoadOptions.PreserveChanges", "City"); break; } }

To get a first-hand feel for how these settings can be used, try stepping through this code in the debugger. You can try the different LoadOption enumerators by changing the opt variable that appears immediately before the switch-case block. The code shown in Figure 4 invokes the ShowVersions method which simply displays to the output window the original version and current version of a given column.

Changing RowState

A row's state is the major factor that helps determine which rows to update, insert, or delete when the DataAdapter's Update method is invoked. The RowState is also examined by the GetChanges method, as well, to determine which rows to grab. When you make changes to values in a DataSet, ADO.NET handles setting the RowState for you, indirectly setting it to one of the RowState values, such as Modified, Added, or Unchanged.

Sometimes it would be really helpful to be able to set a row's state directly. This is where the DataRow's new SetAdded and SetModified methods make things a little easier. For example, let's assume that you are faced with the situation in which you need to copy a handful of rows from one database to another using ADO.NET. Using ADO.NET 2.0 you could fill a DataTable from a database using the DataAdapter's Fill method, change the rows' RowState settings to Added, and send them down to the second database (assuming it uses the same schema) to be added using a second DataAdapter. In this type of situation you could change the RowState of the rows from Unchanged to Added by invoking each row's SetAdded method.

To demonstrate how these methods work, I have included another example, shown in the code in Figure 5. This code retrieves a rowset of customers and sets two of the row's RowState settings to Modified and another row's setting to Added. Then, I use the DataTable's GetChanges method to create a DataTable containing the rows with a RowState of Modified and store the number of rows. I then get the number of rows that were added and display them using a MessageBox.

Figure 5 Changing a Row's State

DataTable dtCustomers = new DataTable("Customers"); using (SqlConnection cn = new SqlConnection(cnStr)) { SqlCommand cmd = new SqlCommand(sqlAllCustomers, cn); SqlDataAdapter adpt = new SqlDataAdapter(cmd); adpt.Fill(dtCustomers); dtCustomers.PrimaryKey = new DataColumn[] { dtCustomers.Columns["CustomerID"] }; } // Change the RowState of a few rows DataRow row = dtCustomers.Rows.Find("ALFKI"); row.SetModified(); row = dtCustomers.Rows.Find("BOLID"); row.SetModified(); row = dtCustomers.Rows.Find("ANTON"); row.SetAdded(); int modRows = dtCustomers.GetChanges( DataRowState.Modified).Rows.Count; int addRows = dtCustomers.GetChanges(DataRowState.Added).Rows.Count; StringBuilder bldr = new StringBuilder(); bldr.Append(modRows.ToString()); bldr.Append(" row(s) were modified."); bldr.Append(Environment.NewLine); bldr.Append(addRows.ToString()); bldr.Append(" row(s) were added"); MessageBox.Show(bldr.ToString());

The SetAdded and SetModified methods of the DataRow only work on rows that are unchanged. Another situation in which these come in handy is when you receive a DataSet or a DataTable from a Web service and the rows are all marked as Unchanged. If you intend to make additions or updates to a database based on the DataTable, you could set the RowState using these new methods. Otherwise, if you leave the RowState as Unchanged, the DataAdpater's Udpate method would not send the rows to either the UpdateCommand or the InsertCommand.

0 to 60

One of the best new features of ADO.NET 2.0 is neither a new method nor a new class but rather a focused performance improvement. One of the big knocks on the DataSet and the DataTable has been how slow they can be to load, especially when the number of rows gets large (100, 1,000, 10,000, or more). It can be just as bad, if not worse, when trying to traverse a large DataTable. To address this practical limitation, a lot of effort went into the creation of a faster indexing engine in ADO.NET 2.0. The rewrite of the indexing engine in ADO.NET has resulted in a huge performance boost in all areas including loading and merging DataTables. Figure 6 shows some sample code that can be run in either Visual Studio® .NET 2003 (using ADO.NET 1.1) or Visual Studio 2005 (using ADO.NET 2.0). I ran a benchmark comparison of this code in both environments with varying numbers of rows.

Figure 6 Speed Test

DataTable dt = new DataTable("foo"); DataColumn pkCol = new DataColumn("ID", Type.GetType("System.Int32")); pkCol.AutoIncrement = true; pkCol.AutoIncrementSeed = 1; pkCol.AutoIncrementStep = 1; dt.Columns.Add(pkCol); dt.PrimaryKey = new DataColumn[] { pkCol }; dt.Columns.Add("SomeNumber", Type.GetType("System.Int32")); dt.Columns["SomeNumber"].Unique = true; int limit = 1000000; int someNumber = limit; DateTime startTime = DateTime.Now; for (int i = 1; i <= limit; i++) { DataRow row = dt.NewRow(); row["SomeNumber"] = someNumber—; dt.Rows.Add(row); } TimeSpan elapsedTime = DateTime.Now - startTime; MessageBox.Show(dt.Rows.Count.ToString() + " rows loaded in " + elapsedTime.TotalSeconds + " seconds.");

The code in Figure 6 creates a DataTable, adds two columns to it, and then loops 1,000,000 times adding a row to the DataTable in each iteration. Once the loop is completed, the elapsed number of seconds is displayed to the user using the MessageBox method. I ran this test for a few different iterations in both ADO.NET 1.1 and ADO.NET 2.0. My results are shown in Figure 7.

Figure 7 Speed Test Results

Iterations ADO.NET 1.1 ADO.NET 2.0
10,000 0.20 0.20
100,000 7.91 3.89
1,000,000 1831.01 23.78

The speed is even better if there are no constraints on the DataTable. For example, when I removed the Unique constraint, I was able to load a million rows in just over a second in both versions of the environment. It is also important to note that I only loaded two columns and the value for the SomeNumber column was a decrementing but still sequential integer. While your results will vary from mine, the key point to take away is that the indexing engine for ADO.NET 2.0 performs significantly faster— so much faster that it is now completely realistic to at least consider using a DataTable to contain a million rows.

Wrapping It Up

In ADO.NET 2.0, there's better performance in areas that were lagging in the previous version, such as loading large numbers of rows. Several new features have been added to make development easier. The DataTable class has gained several methods that already existed on the DataSet class and there is even a new DataTableReader class. In the next installment of the Data Points column, I will continue the discussion of ADO.NET 2.0 by examining how binary serialization improves performance, how to take advantage of batch updates, the new DataView features, the new SqlConnectionBuilder class, and much more.

Send your questions and comments for John to  mmdata@microsoft.com.

John Papa is a Senior .NET Consultant with ASPSOFT and a baseball fanatic who spends most of his summer nights rooting for the Yankees with his family and his faithful dog, Kadi. He has authored several books on ADO, XML, and SQL Server, and can often be found speaking at industry conferences such as VSLive or blogging at codebetter.com/blogs/john.papa.