The WinFS Files: Divide et Impera

 

Sean Grimaldi
Microsoft Corporation

December 2005

Summary: WinFS enables developers to think about data as data, not as specific data formats. Sean Grimaldi explores the implications of being able to unify your data.

Divide and conquer, or unify?

I wasn't immediately sure how to approach this column, because it deals with an abstract architectural problem. The problem is that data is not unified, which denies programmers and end users the ability to use most of their data the way they can use data in a database.

Often, you can solve a difficult problem by dividing it into two or more simpler problems. You then solve the sub-problems and combine the solution to solve the original problem. For example, merge sort is an old sorting algorithm, and like binary search it uses a divide-and-conquer approach. The merge sort algorithm splits a list into halves. You sort each half and then combine them. The result is a sorted list. This approach generally takes less time than sorting the list all at once, depending on how you implement it and the data you sort.

Divide and conquer is such a natural way of problem solving that I suspect most people seldom reflect on it. For me, the difficulty with this approach seems to be in finding a way to divide the problem orthogonally, rather than with the general approach.

Data has been divided.

It seems this divide-and-conquer approach is what we've done with data. Data is a rich concept. By data, I mean numbers, characters, and other artifacts, like images, on which programs operate. Note that this definition of data encompasses the data in file system files as well as the data in relational databases.

A huge amount of data is stored in files. The computer I am writing on has 608,903 files on it, including vacation photos, C# files, Excel workbooks, and many file types for which I don't even recognize the extension.

I also have some SQL Server database files, like Northwnd.mdf. A lot of data is stored in these database files, but by database standards the largest database on this computer is tiny: less than 1 GB.

There are other ways to store data, but the bulk of the data I am interested in manipulating programmatically is either in files or in databases.

Divided data is bad.

Although dividing data into these many formats, each with different expectations and behaviors, has enabled programs and end users to manipulate data, diving data also has negative effects.

One negative effect is the amazing variety of data types and data formats. I like to think of this as the full name problem. The concept of a person is simple. Babies can distinguish between people and non-people shortly after birth. People have names. These concepts are about as simple and fundamental as it gets, and yet there are literally thousands of formats to express person and name in data structures. For example, you can define full name as first name plus surname with an optional middle initial. Some programs use this data structure, but others do not. In many cases, full name is a single string property unrelated to any first name or surname property. Sometimes a person can have multiple full names. For example, my friend has a given name and a nickname he uses with the surname under which he operates his business. Both identifiers refer to the same person in different contexts. Can't we just agree that people have a first name and a surname and that some people have nicknames and zero or more middle names? Do we really need hundreds of ways of representing a person's name as a data structure? There is a cost to redefining the full name concept so many times.

Querying is another problem with diverse data structures. Because a lot of data is stored in files and a lot of data is stored in databases, to query your data you often need a query engine that can return results for data stored as a file or in the database. But this is not a common feature. So, without WinFS, you often have to query and merge the result sets from the file system, databases, and other data stores. This in itself can be a tough programming task. Consider one example of how this affects me. I use Word as my e-mail editor and Outlook as my e-mail client, so sometimes I forget whether I wrote something directly in Word or indirectly through Word in an e-mail. If I used Word, it is probably in a file, but if I used Outlook it is stored in the Exchange server database. This matters, because Exchange and Outlook and the Search Companion toolbar do not search the same things and do not support the same query capabilities.

Even if you can query across physical data stores, it is unfeasible to write meaningful queries without understanding the semantics of each data structure. You need to know how to write a query for a name that can be null, and you need to understand what it means to retrieve a person's name that has a null value. Otherwise, you don't understand the data returned from the query.

A necessary condition for rich query capabilities is that the data must have some semantically equivalent attributes. For example, if a person does not have a name, birth date, gender, mother, location, or other attributes, it becomes very difficult to write a meaningful query for persons. Usually the data needs to be evaluated and, if feasible, transformed and mapped to enable programmers to identify and manipulate equivalent attributes.

There is a cost in transforming among this huge variety of data structures. In perhaps the simplest scenario, it takes some code to transform a full name property to a string and some more code to transform a string to a full name property. In some cases, even this simple transform is impossible without data loss. For example, if full name is a collection of full name structs, there is no way to convert them to a string full name without the possible loss of some of the items in the collection. Full name might be in one table and the person it applies to in another table, so you have to join to get the one or more full names for a given person. For each format, this bridging code has to be specified, coded, tested, and documented, and sometimes it comes out wrong—for example, when the specification did not anticipate that a person does not have a middle name. Between data loss and a high cost of bridging, the variety of data structures is a real problem.

Perhaps the largest cost, though, is in rewriting behaviors for each data structure. Consider the save concept, for example. The concept is straightforward: You want to persist some data to disk for possible use later. The data could be an object, a file, or a database row—all data. For example:

Data One method of saving
Excel documents File ->Save
XML System.Xml.XmlWriter
Objects System.Xml.Serialization.XmlSerializer
Rows Lazy writer writes to disk.
E-mail File->Save

In each of these cases, the code for the save operation is different code.

Even if you want to represent full name in a common way across these different data manifestations, you still have to write code for the save operation in different ways. Word is an example of an application in which the same data in the same application can be saved in multiple formats. In Word, you can save a document as .doc, .xml, .html, .txt, etc. This affects end users, because save means something different to most applications; but it particularly affects developers. Because data is divided into so many manifestations, you have to specify, code, test, and document the save operation multiple times. This is expensive even under the best-case scenario that the applications that perform this save operation all agree on what a full name looks like. Usually they don't, even at the same company.

So let's review. So far, I've said that data is divided and that there are some negatives to this situation. The data is divided into file, database, and other data. The negatives are that programmers have to recreate data types, bridge between data types, and rewrite behaviors repeatedly. If this seems normal to you, it may be because you have operated in this fractured data environment for years and have come to expect nothing more.

WinFS unifies data, solving the problem of divided data.

Until I started working on the WinFS team, I had forgotten the splendor that is data—data that you can search, save, update, share, synchronize, respond to notifications on, treat as, well, data. Some people call this unified data, and it might seem abstract.

Once you begin treating data as data, and not as these various formats, you can do some interesting things. For example, you can think about information in new ways. You could write code that shows a picture of a person, his role, and his contact information for everyone attending a given meeting and every document that they contributed to that you have not already read, and on a personal note, if they have already given you some money for the charity bicycle race you are participating in next month. Today, I have all this data available to my computer, but no feasible way to query it without WinFS. I plainly can't get at a lot of my data.

How WinFS solves this problem:

WinFS resolves this issue by defining common item types. For example, think about never having to recreate a data type for common concepts such as audio record, meeting, contact, person, document, photo, message, video clip, etc.

The following are the WinFS namespaces in which these common entity types reside:

  • System.Storage.Audio
  • System.Storage.Calendar
  • System.Storage.Contacts
  • System.Storage.Documents
  • System.Storage.Image
  • System.Storage.Media
  • System.Storage.Messages
  • System.Storage.Video

A person is a person is a person, no matter where it comes from with WinFS. Think of all the time you have spent coming up with various objects to represent a person. I can recall defining about 10 variations of person as I moved from one team to another. In almost all of these cases, the person type varied in trivial ways. WinFS does not require one schema to rule them all, however. You can choose to extend the WinFS types to suit your goals using either subclassing or an extension mechanism. Let me know if you are interested in an article about extending WinFS types.

Once you have a definition for item types, like person, you don't need to write so much code to bridge between types. For example, if your application and my application both use the WinFS person type, no transformation is necessary. Even if I extend the WinFS person type, we may be able to share the data without transforming it. For example, if I write an extension to the person type that adds a property for their medical insurance member ID, and your application is a greeting card application, I can pass a person between our applications with just a cast to person.

By far the biggest advantage is common behaviors. Think about this in relation to the save operation. You can save a person instance with a line of code, as in the following snippet:

C#

// Access the WinFS store.
using (WinFSData wd = new WinFSData())
{
    // Create a person.
    Person person = new Person(wd.GetRootItem());
    person.DisplayName = "Sean Grimaldi";
    
    // Save the person.
    wd.SaveChanges();
}

The save operation is a part of the WinFS API, so you don't need to rewrite it. It applies to all common entity types.

C#

// Access the WinFS store.
using (WinFSData wd = new WinFSData())
{
    // Create a document.
    Document document = new Document (wd.GetRootItem());
    document.DisplayName = "MyDoc1";
    
    // Save the document.
    wd.SaveChanges();
}

It doesn't matter if this is a special sort of document, such as one I derive from Document; the pattern is the same. Here I derive a Gift type of document from the Document type.

Figure 1. Deriving a Gift type from the Document type.

The usage pattern is the same as it is for Person and Document, because they are all data.

C#

// Access the WinFS store.
using (WinFSData wd = new WinFSData())
{
    // Create a gift.
    Gift gift = new Gift (wd.GetRootItem());
    gift.DisplayName = "MyGift1";
    
    // Save the gift.
    wd.SaveChanges();
}

Actually, I want to do all the typical things one does with data, such as create, retrieve, update, delete, save, share, synchronize, copy, move, restrict access, respond to notifications, etc. Well, with WinFS you get these common behaviors by using the WinFS API.

Unifying data has big payback to programmers!

Because applications can build on WinFS-defined data types, applications can share data—assuming the correct permissions. This is good for you as a programmer, as well as for the end user of your application.

As a programmer, you can reuse data the end user already has in the WinFS store. For example, assume your application presents advertisements to a user of events they could attend, like a bike race or a rodeo.

Before presenting an event to the user, your application can check the user's calendar and display the event differently if the user already has a meeting scheduled that would conflict with the event, such as making the border of the invitation red.

You can also populate data into the WinFS store to be used by other programs. For example, if the end user accepts the invitation to the bicycle race, their personal information manager (PIM) application, such as their e-mail client, could show that date as reserved in their calendar. This also gives the end user a rich user experience in both applications.

Programmers are tired of recreating the same types, like full name, and users are tired of populating the data into each variation of full name. For example, the Human Resources enterprise resource planning (ERP) software at work tracks my emergency contact, but I have to enter this into the eCommerce site I use when I register for events, like a bicycle race. In addition, I had to enter the same information again as my financial beneficiary field at my stock trading site. This is a pain point for end users, and often when you are designing an application, you request less information than you would like so that your application does not burden the user with data input. This comes up all the time with registration forms, which often end up with an address of "asdf".

As a developer, from a data mining point of view, it is annoying when the user types slightly different names and there is no way to know if these refer to the same person or not.

Once you stop thinking of data as all these formats, and just think of it as data, you begin to have higher expectations of it. With WinFS you can just think of this as data, and manipulate it as you would with any data, such as putting it on a USB drive to take with you.

Complex software generally has many settings. Windows, for example, enables you to view a folder's contents in several ways, such as thumbnails, tiles, icons, lists, or detail views. You can select the view you prefer and this setting is persisted from one session to the next. With WinFS, organization can be persisted from one application to the next. For example, if I store person objects in WinFS representing my family members and college friends, an e-mail application should be able to present information from my family members differently than information from my college friends. I would prefer that if I synchronize data, the e-mail from my family members is prioritized higher than e-mail from my college friends, so that if disk space is limited the family e-mail gets synchronized first.

Common behaviors enable new scenarios. One that occurs to me is a cached mode. Outlook has a cached mode where you don't need to be connected to the network to work with e-mail. Many network-based applications, like eCommerce sites, don't have this facility. When I travel, I'd like to shop at an eCommerce site without a network connection. If these eCommerce sites stored data in WinFS, it would be simple to write a local cache manager application that synchronized the results to the eCommerce site when the network connection was restored. This is because the WinFS API includes sync functionality. There is also a performance aspect of caching. Caching data locally enables the great hardware available in PCs, such as multi-core processors, to manipulate the data rather than performing the crunching on a remote server. For example, when you perform a complex group-by operation or large sort, doing this locally can be significantly more responsive than performing the manipulation on a remote server and then moving the sorted data locally.

Once you store data in WinFS and work with WinFS types, you don't have to rewrite code for each type of data.

Right now, most data is owned by the application, not the user. Once data is freed from the format, the user can get at a lot more data. Users need richer ways to organize, manage, share, synchronize, and generally work with their data. WinFS provides these facilities as well. The next column in this series will cover various ways of organizing data with WinFS. Stay tuned as the WinFS adventure continues!

Summary

This article covers the idea of unifying data. It describes how data is not unified today and some of the issues with this. Although a few aspects of WinFS are undocumented, you can read more about it here:

The WinFS Files:

Sean Grimaldi is currently most interested in data access, WinFS, and racing cyclocross. Sean has worked on WinFS since 2002. You can reach Sean at sgrimald@microsoft.com.