The Microsoft Decision Trees algorithm builds a data mining model by creating a series of splits in the tree. These splits are represented as nodes. The algorithm adds a node to the model every time that an input column is found to be significantly correlated with the predictable column. The way that the algorithm determines a split is different depending on whether it is predicting a continuous column or a discrete column.
The Microsoft Decision Trees algorithm uses feature selection to guide the selection of the most useful attributes. Feature selection is used by all Analysis Services data mining algorithms to improve performance and the quality of analysis. Feature selection is important to prevent unimportant attributes from using processor time. If you use too many input or predictable attributes when you design a data mining model, the model can take a very long time to process, or even run out of memory. Methods used to determine whether to split the tree include industry-standard metrics for entropy and Bayesian networks. For more information about the methods used to select meaningful attributes and then score and rank the attributes, see Feature Selection in Data Mining.
A common problem in data mining models is that the model becomes too sensitive to small differences in the training data, in which case it said to be over-fitted or over-trained. An overfitted model cannot be generalized to other data sets. To avoid overfitting on any particular set of data, the Microsoft Decision Trees algorithm uses techniques for controlling the growth of the tree. For a more in-depth explanation of how the Microsoft Decision Trees algorithm works, see Microsoft Decision Trees Algorithm Technical Reference.
Predicting Discrete Columns
The way that the Microsoft Decision Trees algorithm builds a tree for a discrete predictable column can be demonstrated by using a histogram. The following diagram shows a histogram that plots a predictable column, Bike Buyers, against an input column, Age. The histogram shows that the age of a person helps distinguish whether that person will purchase a bicycle.
The correlation that is shown in the diagram would cause the Microsoft Decision Trees algorithm to create a new node in the model.
As the algorithm adds new nodes to a model, a tree structure is formed. The top node of the tree describes the breakdown of the predictable column for the overall population of customers. As the model continues to grow, the algorithm considers all columns.
Predicting Continuous Columns
When the Microsoft Decision Trees algorithm builds a tree based on a continuous predictable column, each node contains a regression formula. A split occurs at a point of non-linearity in the regression formula. For example, consider the following diagram.
The diagram contains data that can be modeled either by using a single line or by using two connected lines. However, a single line would do a poor job of representing the data. Instead, if you use two lines, the model will do a much better job of approximating the data. The point where the two lines come together is the point of non-linearity, and is the point where a node in a decision tree model would split. For example, the node that corresponds to the point of non-linearity in the previous graph could be represented by the following diagram. The two equations represent the regression equations for the two lines.