An Application-Oriented Model for Relational Data

Article
01/14/2009

by Michael Pizzo

Summary: Most applications deal with data of some type or another, where the source of the data often resides in a database. Yet for a variety of reasons the shape of the data in the database is often different from the model that the application interacts with. This article describes working with data in a database through a virtual "Conceptual Model" more appropriate for the application. (12 printed pages)

Application Models vs. Storage Schemas
Database Schema Normalization
Representing Inheritance
Schema Changes
ADO.NET Entities
Enter the ADO.NET Entity Framework
An Application Model
Querying the Conceptual Model
Nested Results
Different Views for Different Applications
Modeling Results as Objects
Conclusions
Resources

Application Models vs. Storage Schemas

Modern applications, Web applications in particular, are fundamentally about exposing and manipulating data of one type or another. The data may be in the form of search results, inventory catalog, user profile, account information, financial information, personnel information, map coordinates, weather, and so on, but it's all data, and it's typically stored in a database somewhere.

However, the data stored in the database is generally not in the most appropriate form for the application to manipulate or expose to the user. The relational data model of the database schema is typically (and rightly) optimized around storage and integrity concerns, not for application usage.

As Dr. Peter Chen explains in his groundbreaking paper introducing the Entity-Relationship Model, "The relational model…can achieve a high degree of data independence, but it may lose some important semantic information about the real world." His paper goes on to describe an alternate Entity-Relationship Model that "...adopts the more natural view that the real world consists of entities and relationships." (See Resources.)

Simply put, today's object-oriented applications, not to mention end-users, tend to reason about data in richer terms than the flat rows and columns of a relational database. The "real world" includes a strong notion of the type of the object, its identity, and its relationships with other objects.

Putting aside the expressivity issues for a moment, even if all of the application concepts could be represented through the relational model, the application writer often doesn't have control over the database schema. Even worse, the schema could change over time in order to optimize different usage patterns, invalidating hard-coded access paths, mappings, and implicit assumptions.

Small applications typically start out directly embedding the logic to map relational schema to application data objects. As the application grows, or for applications that are built from the start as part of a larger enterprise framework, the data-access logic is commonly broken out into a separate Data Abstraction Layer, or DAL. Whether part of the application or a separate component, hard-coding implicit assumptions about database schema, relationships, usage conventions, and access patterns make it increasingly difficult to maintain or extended this data-access code over time, especially in light of changes to the underlying schema.

Let's take a look at some of these problems in a little more detail.

Database Schema Normalization

Data in a database is generally laid out in a "normalized" view as described by Dr. Codd, with separate, homogenous (rectangular) tables containing columns of single scalar values. Redundancy is reduced in order to improve integrity of the data through moving non-row specific values into a separate table. Data from these individual tables is combined through joins, based on implicit application knowledge about what different values within the columns represent. Foreign keys may or may not be used between tables of related information as a means of further enforcing data integrity, but do not themselves define navigation paths or join conditions.

Let's take an example. Suppose you ran a marina that sold used watercraft, and you wanted to track your inventory in a database, exposed to customers through a Web application. The information you have to store for each boat is: Registration, Make, Year, Length and Beam (width). For boats with engines you want to store Make, Model, Year, Horsepower, type of fuel, and SerialNumber of the engine(s). A fully normalized schema might break the engine information into three separate tables, one containing the information for a particular type of motor (Make, Model, Horsepower and fuel type, with Make and Model comprising a Composite Key), one for each actual engine (SerialNumber, Year, Make, and Model) and one that associated boats and engines. The resulting schema might look something like Figure 1.

Figure 1. Fully normalized schema (Click on the picture for a larger image)

In order to show a relatively simple inventory page containing boat registration number, year, and make, along with associated engine information, for all motor boats, your Web application may use the following query:

SELECT Boats.RegNum, Boats.Year AS Boat_Year,
  Boats.Make AS Boat_Make, BoatEngine.SerialNumber,
  BoatEngine.Year AS Engine_Year, BoatEngine.Make
  AS Engine_Make, BoatEngine.Model AS Engine_Model,
  BoatEngine.HP
FROM Boats
  INNER JOIN (
   SELECT EngineType.Make, BoatEngines.BoatID,
                                  EngineTypes.HP
  FROM EngineTypes
    INNER JOIN (Engines
     INNER JOIN BoatEngines
     ON Engines.SerialNumber =
                       BoatEngines.EngineSerialNum)
   ON EngineTypes.Model = Engines.Model
    AND EngineTypes.Make = Engines.Make
 ) AS BoatEngine
ON Boats.RegNum = BoatEngine.BoatID

This query is not only fairly complex, but requires (and embeds) an implicit understanding of how the tables are related; that the Make and Model columns of the Engines table relate to the Make and Model columns of the EngineTypes table, and that the EngineSerialNum and BoatID columns of the BoatEngines table relate to the SerialNum and RegNum columns of the Engines and Boats tables, respectively.

Also, the query attempts to return only motorboats (and not sailboats) by performing an inner join between boats and motors. Of course, this query would miss any motor boats being sold without an engine, and while the query might exclude small sailboats, larger boats (including our 25-foot Hunter sailboat) generally have motors. So, while the assumption may be valid when the application is written, based on the schema and data at that time, baking this type of implicit logic into queries within the application introduces subtle dependencies on the data and schema that are hard to track and maintain.

Clearly, although the schema is nicely normalized from a database perspective, it is not in a very convenient form for the application. What's worse, in order to retrieve the desired information (not to mention making updates such as moving an engine to a different boat) implicit knowledge about the relationships between the tables must be baked into the application. Changes to the schema, for example combining the Engines and BoatEngines tables (a reasonable degree of denormalization a DBA may consider in order to improve performance) causes the application to break in ways that are hard to anticipate and resolve.

Representing Inheritance

Now let's say you want to want to add additional information to the schema. First you make explicit the difference between sailboats and different types of motorboats by adding a "Style" column to the table. Then, for sailboats, you add the type of Keel (Fixed, Swing, or Dagger) and number of sails. For ski boats, you add whether it has a ski pylon and/or a tower, and for MotorYachts, you want to add whether or not it has a flybridge. This becomes challenging with the relational data model because each piece of additional information only applies to a subset of the rows within the Boats table. One way to represent this is by extending the base table ("Boats") with members for each derived type, using Nulls for rows in which they do not apply. For example, this information could be expressed in a single sparse "Boats" table, as shown in Figure 2.

Figure 2. Representing hierarchy in a single sparse "Boats" table (Click on the picture for a larger image)

As we see, the more properties we add for each derived type, the more the schema of the overall table grows and the more we fill in non-relevant fields with Null values. An alternate way to represent this same information would be to break the additional information for Sailboats, Ski Boats, and Motor Yachts into separate tables, as shown in Figure 3.

Figure 3. Storing extended information in separate tables (Click on the picture for a larger image)

This layout avoids the need to add sparse columns to the base table for each property of the derived type, but querying becomes even more complex as each query has to join the additional tables in order to include full information for each of the derived types. For example, to return boat registration, year, make, and associated engine information for all motor boats without a tower, we might write something like the following query:

SELECT Boats.RegNum, Boats.Year AS Boat_Year,
  Boats.Make AS Boat_Make, BoatEngine.SerialNumber,
    BoatEngine.Year AS Engine_Year, BoatEngine.Make
  AS Engine_Make, BoatEngine.Model AS Engine_Model,
  BoatEngine.HP
FROM (Boats
  LEFT OUTER JOIN (
    SELECT EngineTypes.Make, BoatEngines.BoatID,
                                   EngineTypes.HP
   FROM EngineTypes
     INNER JOIN (Engines
      INNER JOIN BoatEngines
      ON Engines.SerialNumber =
                          BoatEngines.EngineSerialNum)
    ON EngineTypes.Model = Engines.Model
       AND EngineTypes.Make = Engines.Make
   ) AS BoatEngine
  ON Boats.RegNum = BoatEngine.BoatID )
  LEFT JOIN SkiBoats
         ON Boats.RegNum = SkiBoats.RegNum
WHERE Boats.Style In ("Ski","Motor","PWC","Yacht")
  AND (SkiBoats.Tower=False OR SkiBoats.Tower IS NULL)

Note that the Boats table must be joined with the SkiBoats table in order to get the Tower information, and we must account for the NULL value in our predicate for rows which are not Ski boats.

Schema Changes

As the tables grow, the DBA may decide to refactor the schema such that all of the information for a particular type of boat is in one table, as shown in Figure 4.

Figure 4. Completely separate tables for each boat type (Click on the picture for a larger image)

This schema optimizes for queries against a single type of boat at the expense of queries across types of boats. And, of course, it means that the queries within your application have to change. Our query for boat and engine information becomes the following:

SELECT Boats.RegNum, Boats.Year AS Boat_Year,
 Boats.Make AS Boat_Make, BoatEngine.SerialNumber,
   BoatEngine.Year AS Engine_Year, BoatEngine.Make
   AS Engine_Make,BoatEngine.Model AS Engine_Model,
   BoatEngine.HP
FROM (
  (SELECT RegNum, Year, Make, Tower FROM SkiBoats
  UNION ALL SELECT RegNum, Year, Make, NULL As
                         Tower FROM MotorBoats
  UNION ALL SELECT RegNum, Year, Make, Null AS
                         Tower FROM PWC
  UNION ALL SELECT RegNum, Year, Make, Null AS
                        Tower FROM MotorYachts
) AS Boats
LEFT OUTER JOIN (
  SELECT EngineTypes.Make, BoatEngines.BoatID,
                            EngineTypes.HP

 FROM EngineTypes
   INNER JOIN (Engines
    INNER JOIN BoatEngines
    ON Engines.SerialNumber =
                      BoatEngines.EngineSerialNum)
    ON EngineTypes.Model = Engines.Model AND
                     EngineTypes.Make = Engines.Make
    ) AS BoatEngine
 ON Boats.RegNum = BoatEngine.BoatID )
WHERE (Boats.Tower=False OR Boats.Tower IS NULL)

Note that in order to query across all types of boats including the Tower column specific to SkiBoats, we must explicitly project a Null value for that field for the other tables within the UNION ALL.

ADO.NET Entities

Each of these examples highlights some of the challenges that application developers face today when trying to expose, manipulate, and persist interesting real-world models into a flat relational schema. The fact that the schema may not be owned by the application developer, and may change over time, helps explain why the data-access "goo" can occupy such a disproportionate share of an application or framework in terms of code, development, and maintenance cost.

Enter the ADO.NET Entity Framework

The ADO.NET Entity Framework will ship in the first half of 2008 as an extension to the .NET Framework that is part of Microsoft Visual Studio code name "Orcas" and represents the first deliverable in a new Microsoft Entity Data Platform for working with data in terms of a rich, common Entity Data Model. The ADO.NET Entity Framework, is an implementation of Dr. Chen's Entity-Relationship model on top of relational data . Instead of letting storage representation dictate the application model, writing to a common Conceptual Model allows applications greater expressivity in modeling data using real-world concepts which can be flexibly mapped to a variety of storage representations. (For more on the ADO.NET Entity Framework, see Resources.)

The Entity Framework uses a Client View mechanism to expand queries and updates written against the conceptual model into queries against the storage schema. The expanded queries are evaluated entirely within the database; there is no client-side query processing. These Client Views may be compiled into your application for performance, or generated at runtime from mapping metadata provided in terms of XML files, allowing deployed applications to work against different or evolving storage schemas without recompilation.

An Application Model

Figure 5 shows a more appropriate application-oriented data model for our Web application. Note that, although I use a class diagram to represent the model, objects are just one way to expose the conceptual model to the application in the Entity Framework. The same conceptual model could be targeted directly using an extended SQL grammar and returned as polymorphic, hierarchical records.

An application-oriented conceptual model

Figure 5. An application-oriented conceptual model

The first thing we notice about this model is that this doesn't look anything like any of the previous storage schemas. For example:

1. The Engine class contains information from both the EngineTypes table (Make, Horsepower, and Fuel) and the Engines table (Year and SerialNumber).

2. Instead of a BoatID property, the Engine class contains a reference to a Boat.

3. Boats contain a collection of zero or more engines.

4. Instead of exposing a Style property on Boat, or having distinct tables for each, we've used the more natural inheritance concept to distinguish the different types of boats.

The Entity Framework allows you to expose this conceptual model to the application, using application concepts such as strong typing, inheritance, and relationships, over any of the previously described database schema. The fact that the mapping is done declaratively, outside of the application, means that if the database schema evolves over time to optimize for different access patterns, only the mapping has to change; the application can continue using the same queries and retrieve the same results against the same conceptual model.

Let's see how working with this conceptual model simplifies our application patterns.

Querying the Conceptual Model

Given this conceptual model, querying a single polymorphic set of boats becomes much simpler. For example, the following code uses this conceptual schema to query the registration year, make, and engine information for all motorboats without a tower.

SELECT boat.RegNum, boat.Year, boat.Make,
                              boat.Engines 
FROM Boats AS boat
WHERE boat IS OF (Motorboat)
   AND (Boat IS NOT OF (SkiBoat)
         OR TREAT(boat AS SkiBoat).Tower = False)

Note that no joins are required in the query; entities are strongly typed, relationships are traversed through properties, and collections can be filtered according to types within the hierarchy.

Nested Results

In each of the first three queries written directly against the database schema, the results would look something like Figure 6. Note that the last boat (the 1996 BayLiner) appears twice. From looking at the data, we see that the 1996 Bayliner is a MotorYacht with two Hino 310 engines. Because relational data is flat, there is no good way to represent multiple engines in a single row of the result; so, two rows are returned for the same boat—one for each engine.

Figure 6. Nested results returned as a rectangular table (Click on the picture for a larger image)

The query against the conceptual model returns an "Engines" column with a single row for each boat containing the collection of engines as shown in Figure 7.

Figure 7. Results returned as a nested table (Click on the picture for a larger image)

Alternatively, if only a subset of the data for each engine is desired (for example, the Make and HP) that information can be projected out as follows:

SELECT boat.RegNum, boat.Year, boat.Make,
        SELECT engine.Make, engine.HP
        FROM boat.Engines AS engine 
FROM Boats AS boat
WHERE boat IS OF (Motorboat)
  AND (Boat IS NOT OF (SkiBoat)
        OR TREAT(boat AS SkiBoat).Tower = False)

Note that this query still does not require the developer to write any joins; the relevant fields from the engine are projected out as a nested column using boat.Engines as the source for the subquery.

Different Views for Different Applications

Web application frameworks in particular often expose different views of the same data through different Web applications. For example, the data you expose to unauthenticated Web clients may be a subset of the data exposed to preferred members, which may be modeled differently than the data exposed to internal administration and reporting applications. Similarly, the schema of the data you work with within your application framework may differ significantly from the schema of the data you exchange in Business to Business transactions. The ADO.NET Entity Framework facilitates these types of scenarios by allowing multiple conceptual models to be mapped to the same database schema.

Modeling Results as Objects

The previous example shows returning results as records. In the case of the ADO.NET Entity Framework, this means returning results as a DataReader, which has been extended to support type information, polymorphism, nesting, and complex values. With Entities, it is also possible to write queries against the same conceptual model and return results as strongly typed business objects. When modeling results as business objects, relationships may be navigated and updated through typed properties on the objects rather than manipulating scalar foreign key values. The business objects may optionally be identity resolved and change tracked.

Use of the conceptual model through business objects is illustrated by the following code example. This example shows querying against the conceptual model to return boats as objects, navigate through properties to the collection of engines, and remove any engines that are not the same year as the boat. Changes are saved to the database through the call to SaveChanges().

  // Specify query as an eSQL string
  string eSql =
     "SELECT VALUE boat FROM Boats AS boat " +
    "WHERE EXISTS(" +
      "SELECT engine from boat.Engines AS engine " +
      "WHERE engine.Year != boat.Year)";

  BoatInventory inventory = new BoatInventory();
  ObjectQuery<Boat> motorizedBoats =
                  inventory.CreateQuery<Boat>(eSql);

  // Include Engines in results  motorizedBoats.Span.Include("Engines");
  // Loop through Engines for each Boat
  foreach(Boat boat in motorizedBoats) {
    foreach(Engine engine in boat.Engines) {
      if(engine.Year!= boat.Year)
        boat.Engines.Remove(engine);
        // alternatively
        // engine.Boat = null;
        
    }
  }
  inventory.SaveChanges();

Note that, as in the previous conceptual query examples, no joins are required in the query. The object results are strongly typed and updatable, and navigating and modifying the relationship between different types is done through properties and methods, rather than updating scalar key values.

The same query could be written using the new Language Integrated Query ("LINQ") extensions being introduced in the next version of Microsoft Visual Studio and the .NET Framework (code-named "Orcas") as shown here:

  BoatInventory inventory = new BoatInventory();
  // Include Engines in queries for Boats
  inventory.Boats.Span.Include("Engines");
  // Specify query through LINQ
  var motorizedBoats =
     from boat in inventory.Boats
    where boat.Engines.Any(e => e.Year != boat.Year)
    select boat;
  // Loop through Engines for each Boat
  foreach(Boat boat in motorizedBoats) {
    foreach(Engine engine in boat.Engines) {
      if(engine.Year != boat.Year)
        boat.Engines.Remove(engine);
        // alternatively
        // engine.Boat = null;
    }
  }
  inventory.SaveChanges();

Conclusions

In summary, there are at least six reasons why writing applications directly to the database storage schema may be problematic:

1. You may not have control over either the application object model or the storage schema.

2. The degree of database schema normalization may make it cumbersome to consume directly from an application.

3. Certain real-world modeling concepts can't be directly represented in a relational schema.

4. The application may be forced to embed implicit knowledge of how fields in the database schema are used that is difficult to track and brittle to maintain.

5. Different applications may want to expose different views of the same data.

6. The database schema may change over time, breaking applications that directly write to that schema.

The ADO.NET Entity Framework allows your applications to target a conceptual model using application concepts such as strong typing, inheritance, and relationships. This conceptual model can be mapped to a variety of storage schemas. Because the mapping is done declaratively, outside of the application, changes to the database schema over time to optimize for different access patterns only requires that the mapping change; the application can continue using the same queries, retrieve the same results, and make changes against the same conceptual model.

Resources

"Next-Generation Data Access: Making the Conceptual Level Real," J. Blakeley, D. Campbell, J. Gray, S. Muralidhar, and A. Nori (MSDN, June 2006).

"The ADO.NET Entity Framework Overview" (MSDN, June 2006).

"The LINQ Project" (MSDN, January 2007).

"Visual Studio Future Versions" (MSDN, January 2007).

"Dr. Peter Chen: Entity Relationship Model—Past, Present, and Future," April 2007.

"The Entity-Relationship Model: Toward a Unified View of Data." Peter P.S. Chen. ACM Transactions on Database Systems (TODS), 1976.

"A Relational Model of Data for Large Shared Data Banks." E.F. Codd. Communications of the ACM, 1970.

About the author

Michael Pizzo has worked for over 17 years in the design and delivery of data-access solutions and APIs at Microsoft. Michael got started in data access as a program manager for Microsoft Excel in 1989, and was involved in the design and delivery of ODBC, along with the ODBC-based Microsoft Query Tool shipped with Microsoft Office. He has been active in the standards organizations, sitting as Chair for the SQL Access Group, working with X/Open on the CAE specification for "Data Management: SQL-Call Level Interface (CLI)", serving as Microsoft's representative to the ANSI X3H2 Database Committee, and as an elected ANSI representative to the ISO committee meetings that defined and adopted Part 3 of the ANSI/ISO SQL specification for a call-Level Interface (SQL/CLI). Michael was a key designer and driver of Microsoft's OLE DB API, and later owned the design and delivery of ADO.NET version 1.0. He is currently a software architect in the Data Programmability Team at Microsoft, contributing to the architecture and design of the next version of ADO.NET and core building block for Microsoft's new Entity Data Platform, the ADO.NET Entity Framework.

This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal Web site.