· DM algorithms that come with SSAS
o Decision Trees Algorithm: uses the values, or states, of the designated “input columns” to predict the states of the column that was designated as “predictable”. It identifies the attribute tree that best predicts the result. allows for interplay between attributes and provides a hierarchy of attribute definitions that can be used to take a decision.
o Clustering Algorithm: grouping of the cases that contain similar characteristics. Identifies how the data forms subgroups and how these subgroups are different from each other. finds patterns without a specific target result.
o Naive Bayes Algorithm: Identifies the attribute that is most likely to predict the result. less computationally intense than others - useful for quickly generating a DMM to discover relationships between input columns and predictable columns. Use to do initial explorations of data, and then later apply the results to create additional DMMs with other algorithms that are more computationally intense and more sophisticated.
o Association Algorithm: Association models are built on datasets that contain identifiers both for individual cases and item set that the cases contain. An association model is made up of a series of item sets and the rules that describe how those items are grouped together within the cases. The rules that the algorithm identifies can be used to predict a customer's likely future purchases, based on the items that already exist in the customer's shopping cart. It basically identifies the subgroup of data that participates in a specific transaction.
o Sequence Clustering Algorithm: Identifies the event that is likely to happen next. takes a sequence of events as input parameter and is well suited for click stream. similar to the Clustering Algorithm. However, instead of finding clusters of cases that contain similar attributes, this algorithm finds clusters of cases that contain similar paths in a sequence.
o Time Series Algorithm: for predicting continuous columns such as product sales. While other Microsoft algorithms create models, time series model is based only on the trends that the algorithm derives from the original dataset to create a forecast model. It basically identifies the trends that are happening and predicting future from the current data.
o Neural Network Algorithm: Similar to the Decision Trees algorithm, this algorithm also Identifies attribute tree that best predicts the result, but involves more than 2 attributes analyzed at a time. probabilities for each possible state of the input attribute when given each state of the predictable attribute.
o Logistic Regression Algorithm: a variation of the Neural Network algorithm, where the HIDDEN_NODE_RATIO parameter is set to 0. This setting will create a neural network model that does not contain a hidden layer, and that therefore is equivalent to logistic regression.
o Linear Regression Algorithm: variation of the Decision Trees algorithm, where the MINIMUM_LEAF_CASES parameter is set to be greater than or equal to the total number of cases in the dataset that the algorithm uses to train.
DM strategies - 2 main kinds of models: predictive & descriptive
- Predictive Models -classification, regression, time series analysis, prediction. can be used to forecast explicit values, based on patterns determined from known results.
o Classification algorithms - predict one or more discrete variables, based on the other attributes in the dataset. E.g. Decision Trees Algorithm.
o Regression algorithms - predict one or more continuous variables, based on other attributes in the dataset. e.g. Regression Algorithm.
o Time Series algorithms - forecast the patterns based on the current set of continuous predictable attributes. e.g. Time Series algorithm
o Prediction - the estimation of future outcomes. works on continuous attribute set. Time Series and Decision Trees Algorithms.
- Descriptive Models - clustering, summarization, association rules, sequence discovery. describe patterns in existing data, and are generally used to create meaningful subgroups such as demographic clusters.
o Segmentation algorithms - divide data into groups, or clusters, of items that have similar properties. e.g. Clustering Algorithm.
o Summarization algorithms - similar to clustering algorithm but instead of grouping the data, it would quantify the members of the group, such as group 1 has more number of line items available and it has most probability of occurring. e.g. Clustering Algorithm.
o Association algorithms - find correlations between different attributes in a dataset. creating association rules, which can be used in a market basket analysis. e.g. Association Algorithm.
o Sequence analysis algorithms - summarize frequent sequences or episodes in data, such as a Web path flow. e.g. Sequence Clustering Algorithm.
- Choosing the right algorithm to use for a specific business task - can be a challenge. While you can use different algorithms to perform the same business task, each algorithm produces a different result, and some algorithms can produce more than one type of result. E.g. you can use the Microsoft Decision Trees algorithm not only for prediction, but also as a way to reduce the number of columns in a dataset, because the decision tree can identify columns that do not affect the final DMM.
Combining algorithms - can use different algorithms to perform the same business task and each algorithm produces a different result. Lift charts check the accuracy of the DMMs once built on the input data. Use more than one algorithm to produce results and analyze the results for choosing the right one. Different algorithms produce different results. The choosing of the algorithms is based on the accuracy and on the business need. Use algorithms together – use some algorithms to explore data, and then use other algorithms to predict a specific outcome based on that data. E.g. you can use a clustering algorithm, which recognizes patterns, to break data into groups that are more or less homogeneous, and then use the results to create a better decision tree model. Use multiple algorithms within one solution to perform separate tasks - E.g. regression tree algorithm can be used to obtain financial forecasting information, and a rule-based algorithm to perform a market basket analysis.
- The bottom line - Task & algorithms to use
o Predicting a sequence. E.g. to perform a clickstream analysis of a company's Web site. use: Sequence Clustering
o Finding groups of common items in transactions. E.g. to use market basket analysis to suggest additional products to a customer for purchase. use: Association , Decision Trees
o Finding groups of similar items. E.g. to segment demographic data into groups to better understand the relationships between attributes. use: Clustering , Sequence Clustering