Say What?

Make Sense of Your Web Feedback using SQL Server 2005

Peter Pyungchul Kim

This article is based on a prerelease version of SQL Server 2005. All information contained herein is subject to change.

This article discusses:

  • How to automate classification of text-based data
  • How to use SQL Server 2005 components to perform manual and automatic text classification
  • Performing term extraction and data preparation
  • Mining model development and refinement
This article uses the following technologies:
SQL Server 2005

Contents

SQL Server 2005 Data Mining
Preparing Data for Manual Classification
Mining Models Development
Using and Refining the Mining Models
Getting the Best Result
Defining the Model
Testing and Further Refinement
Deploying the Model
Conclusion

The interactive nature of the Web has made it a great medium for data collection. As Web sites have evolved beyond click and buy, the solicitation of feedback has held the promise of greater insight into the customer and her needs, necessary product improvements, and future directions. But collecting reams of text-based data is not the same as gleaning useful information from it. In a typical customer feedback Web page, users enter comments about your product (see Figure 1). Some fields present preset answers users can choose from. Other fields let users type in any comments they like. This sort of free-form comment is especially difficult to put to good use because it must be divided into classifications if it is to be analyzed, quantified, and used for decision making.

Figure 1 Customer Feedback Web Page

Figure 1** Customer Feedback Web Page **

The task of dividing text into categories is called text classification—a process that involves identifying the relationship between the categories and the text (words and phrases). Text classification can be manual or automatic. In the manual approach, a group of domain experts analyze the feedback and assign it to categories manually. Various software tools and technologies can help with this task. In the automatic approach, patterns between the feedback and its category are extracted from previously categorized results and used to classify new feedback without human intervention.

The SQL Server™ 2005 data mining functionality can be used for both manual and automatic classification. In this article, I'll cover both and show you how similar practices can be applied to any type of text classification problems, including document classification and blog grouping.

A general Web feedback system typically involves more than just text classification. It consists of a Web feedback front end that collects user feedback and stores it in a database, a GUI-driven classification component that lets a reviewer manage a hierarchy of categories and manually or automatically classify the new feedback, and a reporting component that generates various trend analysis reports. In this article, I'll focus specifically on how the SQL Server 2005 data mining features can be used to develop the core functionality of the text classification component.

SQL Server 2005 Data Mining

Data mining is a technology that enables you to uncover patterns in your data. Detecting relationships between the categories and the text in the Web feedback is a basic data mining task. For instance, if a piece of feedback contains the words "virus" and "Windows Update," a data mining model can classify the feedback in the security category with a high degree of confidence.

Microsoft introduced its first data mining features in SQL Server 2000 as part of Analysis Services. There has since been a huge investment made to equip SQL Server 2005 with a full range of data mining functionality for enterprise customers. Just looking at the surface, you can see that five new algorithms and more than a dozen visualizations have been added, along with additions to the Data Mining Extensions to SQL (DMX) language. You'll also find other great improvements such as programmability models and integration with other components such as online analytical processing (OLAP), Integration Services (formerly known as Data Transformation Services, or DTS), and Reporting Services.

In particular, it should be noted that Integration Services has come of age and is ready for data mining. The new data flow environment is equipped with dozens of building blocks (called transforms) that can help prepare data for data mining. These transforms perform such tasks as sampling, fuzzy grouping/lookup, split/merge/join, adding a derived column, sorting, as well as extraction and loading between data sources. Among the transforms, term extraction and term lookup are the two that I'll be using extensively in this article to prepare data for mining based on text such as Web feedback.

Preparing Data for Manual Classification

To perform manual classification, a group of domain experts must analyze feedback (data) and assign it to categories manually. SQL Server 2005 data mining techniques can then be applied to the feedback to obtain deeper insights from all those comments. In order to perform SQL Server 2005 data mining on the text, you first need to transform the text into term vectors. Producing term vectors from feedback involves two steps: building a dictionary and generating term vectors. In a term vector, text is represented as a set of terms and their frequencies. The term vector preparation can best be performed through an Integration Services project inside Business Intelligence Development Studio.

In order to identify terms from the text so that you can analyze the feedback, you need a dictionary of meaningful terms (words and phrases) for the problem domain. You can build a dictionary based on existing Web feedback by using the Term Extraction transform. This transform parses and understands language-specific grammar, and extracts words and phrases that are potentially representative. The user can control the behavior of the transform by specifying minimum frequency, maximum length of a term (in characters) and optionally supplying user-defined stop words to ignore certain terms.

Although the same dictionary can be reused throughout the classification process, it is certainly possible that the dictionary will lose some relevance over time as new issues are introduced in the Web feedback. Therefore the dictionary must be rebuilt accordingly. This rebuild process can be as simple as rerunning the Integration Services package against a new set of Web feedback.

The term vector table generated from the dictionary terms contains tuples of <CaseKey, Term, Frequency>. CaseKey is the unique identifier of the feedback, and it will be used to pivot terms for each feedback item. The Frequency column contains the number of occurrences of each term in the customer feedback.

The term vector table can be populated by using the Term Lookup transform. An Integration Services package can feed the Description fields from the WebFeedback table into the Term Lookup transform to get back the aforementioned tuples of <CaseKey, Term, Frequency> and then save them to the TermVectors table. The Term Lookup transform is configured to use the dictionary created in the previous step to identify terms.

Mining Models Development

Mining models can best be developed through an Analysis Services project inside Business Intelligence Development Studio. The process involves creating a data source, a data source view from the data source, and a mining structure that may contain several mining models (for information on creating mining models programmatically, see Liu Tang and Paul Bradley's article on the subject in this issue of MSDN®Magazine).

In order to build data mining models, you need to identify and model the case. A case is an entity from which you want to derive patterns. In this example it's some Web feedack. In SQL Server 2005 data mining, a case is represented by a set of columns similar to those in any relational table. Unlike a relational table, however, a case can also have a nested table column to model one-to-many relationships. For instance, a shopping basket can be a case with a nested table column containing a list of products purchased. Text from Web feedback, such as the case in this example, can be viewed as a set of terms. That is, you can use a nested table column to model the text of the feedback. Developing mining models in SQL Server 2005 typically takes at least three steps: creating a mining structure, defining multiple mining models inside the mining structure, and processing the models. A mining structure not only provides a structural definition of cases being modeled, but is also a container of multiple models that share the same cases. A mining structure can be created based on tables defined in a data source view.

Figure 2 Data Source View

Figure 2** Data Source View **

A data source view in an Analysis Services project provides a high-level view of tables and views from a data source. Typically, a data source view is used to explore relationships among the tables and views and perform simple exploratory analysis such as obtaining global distribution, pivoting, and so on. A high-level object such as a cube or a mining structure can be created on top of tables and views from a data source view. Figure 2 shows a data source view created for the data mining of manual classification of Web feedback. It contains only the TermVectors table prepared in the previous step because you'll be using it as both a case-level and nested-level table. Note that this is possible only when there is no case-level column other than the key. You will see a data source view that contains separate tables for case level and nested level in the automatic classification task, which I'll explore later.

The first screen in Figure 3 shows a dialog from the Data Mining Wizard where you select TermVectors as both the case table and the nested table. In the next dialog, the Wizard lets you define a mining structure by selecting the usage type of each column. As shown in the second screen in Figure 3, CaseKey of the TermVectors table is being used as the key of the case and Term as the key of a nested column (Term is unique in the rows pivoted by CaseKey from the TermVectors table). For the sake of simplicity, the mining structure doesn't include the Frequency column .

Figure 3 Data Mining Wizard Dialogs

Figure 3** Data Mining Wizard Dialogs **

After you're finished with the Wizard, you can continue to add more candidate mining models to the mining structure. The SQL Server 2005 data mining component provides two unsupervised algorithms to help with manual text classification: Microsoft_Clustering and Microsoft_Association_Rules. Figure 4 shows the two models created inside the mining structure. Note that the first model (WFeedback_CL) was created as part of the Data Mining Wizard, and the second model (WFeedback_AR) was added to the mining structure later.

Figure 4 Data Mining Models for Manual Classification of Web Feedback

Figure 4** Data Mining Models for Manual Classification of Web Feedback **

Deploying the project will create and process the mining structure in the specified Analysis Services server. Note that the processing time is highly dependent on the complexity of the models, as defined by the number of pieces of feedback, the number of distinct terms, and the set of parameters used in the models.

Using and Refining the Mining Models

The Microsoft_Clustering model, WFeedback_CL, helps the reviewer who is performing the manual classification to automatically identify clusters (groups) of cases that are similar to each other based on the terms. The clusters are often good candidates for categories. Using the cluster viewer, the reviewer can analyze how each cluster is different from the others based on the terms used. For instance, Figure 5 shows discrimination information for Cluster 2 and other clusters. While terms like SCSI, motherboard, monitor, controller, and hard drive favor Cluster 2, terms like image and expo favor other clusters. You can apply similar analysis to other clusters. Using this technique, you or the reviewer may decide that Cluster 2 is a category under hardware issues (or create a new one if necessary) and then assign the cases in the cluster to that category.

Figure 5 Cluster Viewer for WFeedback_CL

Figure 5** Cluster Viewer for WFeedback_CL **

Assigning the cases in a cluster to a category can be accomplished in three steps. First enable the DRILLTHROUGH option of the model using the property editor, and reprocess the model. Second, obtain CaseKeys of the feedback in the cluster using the following DMX (the SQL Server data mining language):

SELECT CaseKey FROM WFeedback_CL.CASES WHERE IsInNode('Cluster 2')

Third, update the category column of the feedback selected using the SQL statement shown in the following code:

UPDATE WebFeedback SET Category = 'hardware issues' WHERE CaseKey IN (<list of CaseKeys selected in the previous step>)

The last two steps can be performed in a single SQL statement by using OPENQUERY after setting a linked server to the Analysis Services server. Refer to SQL Server Books Online for detailed information on how to set up a linked server.

Figure 6 WFeedback_AR in the Dependency Network Viewer

Figure 6** WFeedback_AR in the Dependency Network Viewer **

The Microsoft_Association_Rules model, WFeedback_AR, can be used to analyze association among the terms. Strong association among the terms often indicates a distinctive subject in the feedback. For instance, Figure 6 is a portion of the dependency network of the model, WFeedback_AR. It shows groups of terms that appear together in the same feedback with a certain degree of confidence. From the groups, the reviewer may learn that there are cases about security issues, hardware issues, and My MSN access issues. The reviewer may decide to make each group a separate category. He may also want to scope down to only the cases that contain the terms in a group in the dependency network. You can easily identify the cases with certain terms using the SQL command shown in the following code snippet:

SELECT * FROM WebFeedback wf, TermVectors tv WHERE wf.CaseKey = tv.CaseKey AND tv.Term IN (<list of terms>)

Getting the Best Result

Often the initial models don't yield the best result. This is because the default parameters for the models may not be the best for the particular data set. A repetitive refinement process is almost always required to produce the best model. The refinement process is a loop of multiple steps such as browsing the model content, adjusting model parameters, and reprocessing the model until you're able to obtain the most accurate model. Since the two models I have built in this section are unsupervised (rather than accuracy-driven), deciding which one is best is a subjective call. As shown in the previous section, a model that gives good insights on the feedback should be considered a good model. Each algorithm provides a different set of parameters to adjust. Figure 7 shows a partial list of parameters for Microsoft_Clustering and the Microsoft_Association_Rules algorithm.

Figure 7 Some Algorithm Parameters

Microsoft_Clustering
CLUSTER_COUNT Specifies the maximum number of clusters to be built by the algorithm. If set to 0, the algorithm automatically determines the best number of clusters to build for predictive analysis; however, it may result in too many clusters or a cluster size that is too large for analysis.
MAXIMUM_INPUT_ATTRIBUTES Limits the number of input attributes (such as, distinct terms) being considered to model. The default is 255, which may be too small for Web feedback classification. It may have to be adjusted to a larger number or to 0 to disable the limitation. Note, however, that increasing or removing this limitation will introduce a longer time to process the model.
Microsoft_Association_Rules
Parameter Definition
MINIMUM_ITEM_SIZE and MAXIMUM_ITEM_SIZE Specifies the minimum and maximum number of items allowed in an itemset. Be sure to set these parameters to include itemsets with size 1 and size 2 since a dependency network is generated from such itemsets.
MINIMUM_SUPPORT and MAXIMUM_SUPPORT Specifies the minimum and maximum number of cases that the itemset must contain before generating a rule. Setting this window too high will result in too few rules while setting it too low will result in too many rules, longer processing time, and more memory consumption.
MINIMUM_PROBABILITY and MAXIMUM_PROBABILITY Specifies the minimum and maximum probability that a particular rule is true. Too high a probability often indicates obvious rules, while too low a probability means there are too many trivial rules.

Data mining technology can also be used to automate the text classification process completely. To start you need a set of Web feedback that has already been categorized manually. A data model can be built on top of such feedback to be used to find any patterns between texts as well as demographic information of feedback and the categories. The patterns can then be applied in order to classify new feedback automatically.

Data preparation for automatic classification is similar to that for manual classification except that it requires another step—sample preparation. That is, in addition to the preparation of dictionary and term vectors, you also need to prepare multiple samples for different purposes in supervised modeling (such as prediction-driven modeling). Note that you do not need the entire feedback to train a model. A relatively small random sample is good enough as long as it contains all significant patterns that could be found in the feedback as a whole. Although there is no strict rule for determining the right size of the sample, a sample size of 5,000 cases is sufficient in many text classification applications. To figure out the right size of the sample, you can also begin increasing the sample size until the accuracy ceases to improve.

As in the manual classification, you can stage the Web feedback to a SQL Server database using the Integration Services Import/Export wizards, or a manually created Integration Services package that loads the whole feedback, performs sampling, and saves the sample to a staged database.

You'll need to have multiple samples of cases to use for different purposes throughout the data mining process. Typically, they should include train and test sample cases. Train sample cases are used to train and refine models. Test sample cases are typically set aside to be used to test and validate models. It is strongly recommended that test sample cases don't overlap with train sample cases. Otherwise, the model tends to learn too much detail that may be specific only to the train sample cases. This is called overfitting.

Figure 8 Integration Services Package

Figure 8** Integration Services Package  **

Note that sampling should be applied on cases rather than terms. Sampling on cases wouldn't change patterns as long as the same or similar patterns appear in other cases sampled. Sampling on terms, however, effectively changes the content (behavior) of the feedback. Figure 8 shows an Integration Services package used to create the three samples. Seventy percent of total cases collected at the initial data staging step are used for training; the remaining thirty percent are for testing.

Figure 9 Mining Data Source View

Figure 9** Mining Data Source View **

Figure 9 shows a data source view created for data mining for automatic classification of Web feedback. The TermVectors table is the same as the one shown in the manual classification section except that it is now generated from the selected samples. Each sample consists of a key (CaseKey), a text (Description), and a category (Category). The data source view contains the two samples (TrainSamples and TestSamples, as you see in the figure).

Defining the Model

Unlike the mining structure in the manual classification section where the TermVectors table was used for both case and nested tables, I use TrainSamples as the case table and TermVectors as the nested table (see the first screen in Figure 10) because now you will use a non-key column (Category) from the case table. In the next step, the Data Mining Wizard presents a dialog where you define a mining model inside the structure by selecting the usage type for each column. As shown in the second screen in Figure 10, you are using CaseKey as the key of the case, TermVectors as input, and Category as output (here it's predictable—the column that you will predict from the model).

Figure 10 Data Mining Wizard Dialogs

Figure 10** Data Mining Wizard Dialogs **

After you are finished with the Wizard, you can continue to add more candidate mining models to the mining structure. The SQL Server 2005 data mining component provides multiple algorithms to address automatic text classification problems. The algorithm that yields the best accuracy depends on the data set. In general, the following four algorithms should be considered as candidates for the best model: Microsoft_Neural_Network, Microsoft_Naive_Bayes, Microsoft_Decision_Trees, and Microsoft_Clustering.

Figure 11 shows the candidate models created inside the mining structure. Deploying the project will create and process the mining structure on the specified Analysis Services server. Note that the processing time is highly dependent on the complexity of the models, which is determined by the number of pieces of feedback and the number of distinct terms.

Figure 11 Candidate Models for Automatic Classification

Testing and Further Refinement

As I noted earlier, to get the most accurate model, you may have to continually refine the results. Figure 12 presents a partial list of parameters for each algorithm that may need to be adjusted for the Web feedback classification task.

Figure 12 Parameters to Tweak for Better Accuracy

Microsoft_Neural_Network
HIDDEN_NODE_RATIO Controls the number of nodes in the hidden layer of the network. Setting this parameter to 0 makes the model behave as a pure logistic regression algorithm, which tends to produce good accuracy for the text classification task.
MAXIMUM_INPUT_ATTRIBUTES Limits the number of input attributes (such as distinct terms) being considered to model. The default is 255, which may be too small for Web feedback classification. It may have to be adjusted to a larger number or to 0 to disable the limitation. Increasing or removing this limitation, however, will introduce a longer time to process the model.
Microsoft_Naive_Bayes
Parameter Definition
MAXIMUM_INPUT_ATTRIBUTES Is the same as in Microsoft_Neural_Network.
Microsoft_Decision_Trees
Parameter Definition
COMPLEXITY_PENALTY Controls the complexity (depth) of the tree produced by the model by specifying a number between 0.0 and 1.0. This parameter may need to be adjusted to a lower penalty (in other words, to learn more fine-grained patterns) as patterns between the category and the terms in Web feedback tend to be more fine-grained than those in other non-text data mining applications.
MINIMUM_LEAF_CASES Indicates the minimum number of cases that a leaf node in the tree has to contain. For the same reason described above, it may need to be changed to a smaller number than the default.
MAXIMUM_INPUT_ATTRIBUTES Is the same as in Microsoft_Neural_Network.
Microsoft_Clustering
Parameter Definition
MAXIMUM_INPUT_ATTRIBUTES Is the same as in Microsoft_Neural_Network.

Measuring the accuracy of models can be achieved by using the test samples of Web feedback that were set aside in the data preparation step. SQL Server 2005 provides a lift chart for accuracy measurement. Figure 13 shows a dialog where the TestSamples table (for case) and the TermVectors table (for nested column) are used to produce a lift chart. (Lift is the measure of the accuracy of a predictive model.) Figure 14 shows the lift chart for all the four models developed in the previous step by using the test samples. The lift chart is created as outlined in the following paragraph.

Figure 13 Text Samples and Their Term Vectors

Figure 13** Text Samples and Their Term Vectors **

The data mining engine feeds each test sample to each mining model as input and obtains a prediction value. The model not only predicts the most likely Category, but also returns a probability to indicate how confident that prediction is. The higher the probability the more likely the prediction is correct (that is, the prediction result is the same as the one in the test sample).

Figure 14 Lift Chart Showing Accuracy of Models

Figure 14** Lift Chart Showing Accuracy of Models **

Once all prediction results and their probabilities are obtained, they are sorted in order of decreasing probability. The graph is created to show the percentage of the correct predictions as the population increases towards 100 percent of the prediction results.

An ideal model would predict everything correctly. That is, the percentage of correct predictions is the same as the percentage of predictions covered at any point. On the other hand, a random guess model would always show 1/N of the population, where N is the number of distinct categories because the chance of a random prediction being correct is 1/N. Obviously, a model with a graph closer to the ideal model is better. Figure 14 shows that WFeedback_NN outperforms the other three models.

Deploying the Model

SQL Server 2005 data mining employs a client-server architecture in which all data mining models are hosted on the server. Deploying a model is as simple as placing the model on the server. It could even be done by developing the model on the deployment server directly. The model can also be migrated easily to a different server simply by using the DMX IMPORT and EXPORT commands.

Applications that perform Web feedback classification can be developed in two different ways. An operational classification system can set up a DTS package that produces a term vector for a new piece of Web feedback using a Term Lookup transform. The term vector can be saved in a temporary file and used in a DMX query that performs prediction for the category. Alternatively, a batch classification system can be implemented using the DTS package, where a Term Lookup transform produces term vectors for the new Web feedback and a Data Mining Prediction transform produces categories for each piece of feedback.

Conclusion

Organizations can analyze Web feedback to help make business decisions and reach business goals such as increasing customer satisfaction and enhancing the quality of their products and services. The use of the techniques I've described is not limited to the text classification problems I've outlined; they can be applied to everything from document classification and patent categorization to memo grouping. The SQL Server 2005 data mining component is ready for prime time. In particular, the two Integration Services transforms (Term Extraction, Term Lookup) are very useful for producing term vectors from text. Multiple data mining algorithms combined with the nested table column functionality and lift chart from Analysis Services are the tools you need to develop the best models for making sense of your text-based data.

Peter Pyungchul Kim, Ph.D. is a developer in SQL Server Data Mining team at Microsoft. He is a primary developer of the data mining language, DMX, and has architected several data-mining algorithms such as decision trees, Naïve Bayesian, and neural network in SQL Server 2000 and SQL Server 2005.