A Flexible Model for Data Integration

Article
01/14/2009

by Tim Ewald and Kimberly Wolk

Summary: There are many challenges in systems integration for architects and developers, and the industry has focused on XML, Web services, and SOA for solving integration problems by concentrating on communication protocols, particularly in regard to adding advanced features that support message flow in complex network topologies. However, this concentration on communication protocols has taken the focus away from the problem of integrating data. Flexible models for combining data across disparate systems are essential for successful integration. These models are expressed in XML schema (XSD) in Web service–based systems, and instances of the model are represented as XML transmitted in SOAP messages. In our work on the architecture of the MSDN TechNet Publishing System (MTPS), we addressed three pitfalls. We'll look at what those pitfalls are and our solutions to them, in the context of a more general problem—that of integrating customer information.

The Essential Data-Integration Problem
Demanding Too Much Information
No Effective Versioning Strategy
No Support for System-Level Extension
Mitigating the Risks
About the Authors
Resources

Systems integration presents a wide range of challenges to architects and developers. Over the last several years, the industry has focused on using XML, Web services, and service-oriented architecture (SOA) to solve integration problems. Much of the work done in this space has concentrated on communication protocols, especially on adding advanced features designed to support messages flowing in complex network topologies. While there is undoubtedly some value in this approach, all of this work on communication protocols has taken focus away from the problem of integrating data.

Having flexible models for combining data across disparate systems is essential to a successful integration effort. In Web service–based systems, these models are expressed in XSD. Instances of the model are represented as XML that is transmitted between systems in SOAP messages. Some systems map the XML data into relational databases; some do not. From an integration perspective, the structure of those relational database models is not important. What matters is the shape of the XML data model defined in XSD.

There are three pitfalls that Web service–based data-integration projects typically fall into. All three are related to how they define their XML schemas. We confronted all three in our work on the architecture of the MSDN TechNet Publishing System (MTPS), the next-generation XML-based system that serves as the foundation for MSDN2. We will look at our solutions in the context of integrating customer information.

The Essential Data-Integration Problem

Imagine you work at a large company. Your company has many outward-facing systems that are used by customers to accomplish a variety of tasks. For instance, one system offers customized product information to registered users who have expressed particular interests. Another system provides membership-management tools for customers in your partner program. A third system tracks customers who have registered to come to upcoming events. Unfortunately, the systems were all developed separately—one of them, by a separate company that your company acquired a year ago. Each of these systems stores customer-related information in different formats and locations.

This setup presents a critical problem for the business: It doesn't have a unified view of a given customer. This problem has two effects. First, the customer's experience suffers, because the single company with which they are doing business treats them as different people when they use different systems. For example, customers who have expressed a desire to receive information about a given product through e-mail whenever it becomes available have to express their interest in that product a second time, when they register for particular talks at an upcoming company-sponsored event. Their experiences would be more seamless if the system that registered them for an upcoming event already knew about their interest in a particular product.

Second, the business suffers, because it does not have an integrated understanding of its customers. How many customers who are members of a partners' program are also receiving information about products through e-mail? In both cases, the divisions between the systems with which the customer works are limiting how well the company can respond to its customers' needs.

Whether this situation arose because systems were designed and developed individually, without any thought to the larger context within which they operate, or because different groups of integrated systems were collected through mergers and acquisitions is irrelevant. The problem remains: The business must integrate the systems to improve the customer experience and their knowledge of who the customer is, and how best to serve them.

The most common approach to solving this problem is to mandate that all systems adopt a single canonical model for a customer. A group of architects gets together and designs the company's single format for representing customer data as XML. The format is defined using a schema written in XSD. To enable systems to share data in the new format, a central team builds a new store that supports it. The XSD data model team and store team deliver their solution to all the teams responsible for systems that interact with customers in some way and require that they adopt it. The essential change is shown in Figures 1 and 2.

Each system is modified to use the underlying customer-data store through its Web service interface. They store and retrieve customer information as XML that conforms to the customer schema. All of the systems share the same service instance, the same XSD data model, and the same XML information.

This solution appears to be simple, elegant, and good, but a naïve implementation will typically fail for one of three reasons: demanding too much information, no effective versioning strategy, and no support for system-level extension.

Demanding Too Much Information

The first potential cause for failure is a schema and store that require too much information. When people build a simple Web service for point-to-point integration, they tend to think of the data their particular service needs. They define a contract that requires that particular data be provided. When a contract is generated from source code, this data can happen implicitly. Most of the tools that map from source code to a Web service contract treat fields of simple value types as required data elements, insisting that a client send it. Even when a contract is created by hand, there is still a tendency to treat all data as required. As soon as the service determines (by schema validation or code) that some required data is not present, it rejects a request. The client gets a service fault.

This approach to defining Web service contracts is too rigid and leads to systems that are very tightly coupled. Any change in the service's requirements forces a change in the contract and in the clients that consume it. To loosen this coupling, you need to separate the definition of the shape of the data a service expects from a service's current processing requirements. More concretely, the data formats defined by your contract should treat everything beyond identity data as optional. The implementation of your service should enforce occurrence requirements internally at run time (either using a dedicated validation schema or code). It should be as forgiving as possible when data is not present in a client request and degrade gracefully.

In the customer-information example, it is easy to think of cases where some systems want to work with customers, but do not have complete customer information available. For instance, the system that records a customer's interest in a particular product might only collect a customer's name and preferred e-mail address. The event registration system, in contrast, might capture address and credit-card information, too. If a common customer-data model requires that every valid customer record include name, e-mail, address, and credit-card information, neither system can adopt it without either collecting more data than it needs or providing bogus data. Making all the data other than the identity (ID number, e-mail address, and so forth) optional eases adoption of the data model, because systems can simply supply the information they have.

Figure 1. Three separate data stores, one per system (Click on the picture for a larger image)

By separating the shape of data from occurrence requirements, you make it easier to manage change in the implementation of a single service. It is also critical when you are defining a common XML schema to be used by multiple services and clients. If too much information is mandatory, every system that wants to use the data model may be missing some required piece of information. That leaves each system with the choice of not adopting the shared model and store, or providing bogus data (often, the default value of a simple programming-language type). Either option can be considered a failure.

Figure 2. A single data store and format (Click on the picture for a larger image)

You gain a lot of flexibility for systems to adopt the model by loosening the schema's occurrence requirements nearly completely. Each system can contribute as much data as it has available, which makes a common XML schema much easier to adopt. The price is that systems receiving data must be careful to check that the data they really need is present. If it is not present, they should respond accordingly by getting more data from the user or some other store, by downgrading their behavior, or—only in the worst case—generating a fault. What you are really doing is shifting some of the constraints you might normally put in an XML schema into your code, where they will be checked at run time. This shifting gives you room to change those constraints without revising the shared schema.

No Effective Versioning Strategy

The second potential cause for failure is the lack of a versioning strategy. No matter how much time and effort is put into defining an XML schema up front, it will need to change over time. If schema, the shared store that supports them, and every system that uses them has to move to a new version all at once, you cannot succeed. Some systems will have to wait for necessary changes, because other systems are not at a point where they can adopt a revision. Conversely, some systems will be forced to do extra, unexpected work, because other systems need to adopt a new revision. This approach is untenable.

To solve this problem, you need to embrace a versioning strategy that allows the schema and store to move ahead independent of the rate at which other systems adopt their revisions. This solution sounds simple, and it is, as long as you think about XML schemas the right way.

Figure 3. A combination of stores (see Figures 1 and 2) (Click on the picture for a larger image)

Systems that integrate using a common XML schema view it as a contract. Lowering the bar for required data by making elements optional makes a contract easier to agree to, because systems are committing to less. For versioning, systems also need to be allowed to do more without changing schema namespace. What this means in practical terms is that a system should always produce XML data based on the version of the schema with which it was developed. It should always consume data based on that same version with additional information. This definition is a variation on Postel's Law: "Be liberal in what you accept, conservative in what you send." Arguably, this idea underlies all successful distributed systems technologies, and certainly all loosely coupled ones. If you take this approach, you can extend a schema without updating clients.

In the customer example, an update to the schema and store might add support for an additional optional element that captures the user's mother's maiden name for security purposes. If systems working with the old version generate customer records without this information, it's okay, because the element is optional. If they send those records to other systems that require this information, the request may fail, and that is okay, too. If new systems send customer data including the mother's maiden name to old systems, that is okay also, because they are designed to ignore it.

Happily, many Web service toolkits support this feature directly in their schema-driven marshaling plumbing. Certainly, the .NET object-XML mappers (both the trusty XmlSerializer and the new XmlFormatter/DataContract) handle extra data gracefully. Some Java toolkits do, too, and frameworks that support the new JAX-WS 2.0 and JAXB 2.0 specifications will, also. Given that, adopting this approach is pretty easy.

The only real problem with this model is that it introduces multiple definitions of a given schema, each representing a different version. Given a piece of data—an XML fragment captured from a message sent on the wire, for instance—it is impossible to answer the question: "Is this data valid?" The question of validity can only be answered relative to a particular version of the schema. The inability to state definitively whether a given piece of data is valid presents an issue for debugging and, possibly, also for security. With data models described with XML schema, it is possible to answer a different and more interesting question: "Is this data valid enough?"

This question is really what systems care about, and you can answer it using the validity information most XML-based schema validators provide, which is a reasonable path to take in cases where schema validation is required, and it can be implemented with today's schema validation frameworks.

No Support for System-Level Extension

The third potential cause for failure is lack of support for system-specific extensions to a schema. The versioning strategy based on the notion that a schema's definition changes over time is necessary to promote adoption, but it is not sufficient. While it frees systems from having to adopt the latest schema revision immediately, it does nothing to help systems that are waiting for specific schema updates. Delays in revisions also can make a schema too expensive for a system to adopt. The solution to this last problem is to allow systems that adopt a common schema to extend it with additional information of their own. The extension information can be stored locally in a system-level store (see Figure 3).

Figure 4. The same store (see Figure 3), with data formats in a shared store (Click on the picture for a larger image)

In this case, each system is modified to write customer data both to its dedicated store using its own data model and to the shared store using the canonical schema. It is also modified to read customer data from both its dedicated store and the shared store. Depending on what it finds, it knows whether a customer is already known to the company and to the system. Table 1 summarizes the three possibilities.

The system can use this information to decide how much information it needs to gather about a customer. If the customer is new to the company, the system will add as much information as it can to the canonical store. That information becomes available for other systems that work with customers. It may also store data in its dedicated store to meet its own needs. This model can be further expanded so that system-specific data is stored in the shared store, too (see Figure 4).

This solution makes it possible for systems to integrate with one another, using extension data that is beyond the scope of the canonical schema. To work successfully, the store and other systems need to have visibility into the extension data; in other words, it cannot be opaque. The easiest way to solve this problem is to make the extension data itself XML. The system providing the data defines a schema for the extension data, so other systems can process it reliably. The shared store keeps track of extension schemas, so it can ensure that the extension data is valid, even if it does not know explicitly what the extension contains. In the most extreme case, a system might choose to store an entire customer record in a system-specific format as XML extension data. Other systems that understand that format can use it. Systems that do not understand it rely on the canonical representation, instead.

When systems are independent, each controls its own destiny. They can capture and store whatever information they need, in whatever format and location they prefer. The move to a single common schema and store changes that. If adopting a common XML data format restricts a system's freedom to deliver required functionality, you are doomed to fail.

Using a combination of typed XML extension data in either a system-level or the shared store adds complexity, because you have to keep data synchronized. But it also provides tremendous flexibility. You can align systems around whatever combination of the canonical schema and alternate schemas you want. You can drive toward one XML format over time, but you always have the flexibility to deviate from that format to meet new requirements. This extra freedom is worth a lot in the uncertain world of the enterprise.

Table 1. The three possible cases of customer data

Record in shared store	Record in system store	Meaning
No	No	Customer is new to the company. Collect all common and system-specific data.
Yes	No	Customer is known to the company, but new to the system. Collect system-specific data.
Yes	Yes	Customer is known to the company and to the system.

A further, subtle benefit of this model is that it allows the team defining the common schema to slow its work on revisions. Systems can use extension data to meet new requirements between revisions. The team working on the canonical model can mine those extensions for input into its revision process. This feedback loop helps ensure that model changes are driven by real system requirements.

Mitigating the Risks

Lots of organizations are working on integrating systems, using XML data described using XML schema and exchanged through Web services. In this discussion, we presented three common causes for failure in these data-centric integration projects: demanding too much information, no effective versioning strategy, and no support for system-level extensions. To mitigate these risks:

Make schema elements optional, and encode system-specific occurrence requirements as part of that system's implementation.
Build systems that produce data according to their version of a shared schema but consume, so that systems can adopt schema revisions at different rates without changing schema namespace.
Allow systems to extend shared schemas with their own data to meet new requirements independent of data-model revisions.

All of these solutions are based on one core idea: to integrate successfully without sacrificing the agility systems need to be able to agree on as little as possible and still get things done. So, does it all work? The answer is: Yes. These techniques are core to the design of the MSDN/TechNet Publishing System, which underlies MSDN2.

About the Authors

Tim Ewald is a principal architect at Foliage Software Systems, where he helps customers design and build applications ranging from enterprise IT to medical devices. Prior to joining Foliage, Tim worked at Mindreef, a leading supplier of Web services diagnostics tools. Before that, Tim was a program manager lead at MSDN, where he worked with Kim Wolk as co-architects of MTPS, the XML- and Web service–based publishing engine behind MSDN2. Tim is an internationally recognized speaker and author.

Kim Wolk is the development manager for MSDN and the driving force behind MTPS, the XML- and Web service–based publishing engine behind MSDN2. Previously, she worked as MTPS co-architect and lead developer. Before joining MSDN, Kim spent many years as a consultant, both independently and with Microsoft Consulting Services, where she worked on a wide range of mission-critical enterprise systems.

Resources

MSDN Magazine

MSDN2 Library

This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal website.