Hi. Last week, we learned how to program

descriptive analytical SQL queries to our data warehouse.

This week, we will learn the process of data mining, it's typical architecture,

and the impact of data mining for business intelligence with some algorithms.

Nowadays, large amounts of data from

everywhere flow at speeds and volumes never seen before,

and companies that have successfully used

analytics are obtaining extraordinary business results.

However, how good be convenient to start analyzing such amount of data,

to get the best business advantages?

Well, this is where predictive analytics,

data mining, machine learning,

and decision management come into play.

Data mining is a process of discovering patterns in

large data sets involving methods at the intersection of machine learning,

statistics and database systems.

As we have learned, data mining is used to improve OLAP analysis.

Data mining is also used within the process of knowledge data discovery.

Remember, that there are several approaches to

data analysis such as: basic descriptive analysis,

many obtained by SQL programming,

descriptive statistical analysis, such as mean,

mode, standard deviation, et cetera.

Inferential statistical analysis with models,

inferences and predictions, with correlation,

regression, variance, et cetera.

Analysis with data mining,

involves artificial intelligence and machine learning.

In the case of predictive analytics and data mining,

it helps to evaluate what will happen in the future.

Data mining searches for hidden patterns in the data,

that can be used to predict future behavior through machine learning.

Businesses, scientifics and governments,

have used this approach for years to transform data into proactive knowledge.

Decision management converts that knowledge into

actions that are used in their operational processes.

So as long as the same approaches can be applied today,

they need to occur more quickly and on a large scale,

using the most modern techniques currently available.

However what would be the benefits of discovering patterns or models through data mining?

Innovative organizations use data mining and predictive analytics to,

among other things: detect fraud and cybersecurity problems,

risk management, sports sales trends,

develop smarter marketing campaigns,

predict customer loyalty, medical treatments et cetera.

In the case of automated analysis,

are quick implementation of knowledge obtained from predictive analysis,

ensures that the convenience of the analytical models,

is not lost due to slow processes such as rewriting the code for each environment,

revalidating the rewritten models,

or any other manual process.

One of the very important steps to innovator how system implementation,

is to ask even before you start the project,

what are the benefits of a decision support system with bring.

So, in order to calculate the return of investment,

we need to ask, what will be the possible investment?

What are the expected returns,

and establish the corresponding resources and implement a project accordingly.

Plan the project according to milestones,

keep track of this in order to control the project,

and verify that we have achieved the expected return.

For example, have we obtained expected results?

Measure the resource and implement the action plan,

to correct it in case it is necessary.

What can we do to correct the situation if we have not achieved the results?

Now, I will present us typical architecture for data mining.

As I said before, there are several ways to

implement retrospective analysis with basic SQL,

OLAP queries, visualization and basic statistics.

All these allow to know what happened.

In the case of data mining,

there are plenty of task and techniques that are helped establish descriptive,

prospective analysis by the data patterns or of models that allow to know,

why things happen and what is next.

A common use of data mining and machine learning technique

is automatic cementation of customers by behavior,

demographics or attitudes to better understand the needs

of specific groups and address them in a more targeted manner.

This analytical segmentation, or unsupervised modelling,

helps to identify groups of clients that are similar and that could

react to certain offers or activities in a similar way.

Another important use for data mining and machine learning is to detect frauds,

which is important as first as developed more sophisticated tactics.

Data-mining provide tangible benefits such as cost reduction,

generation of income, reduction of time for different business activities.

Provides some intangible benefits like decision-making,

improvement of competitive position,

and also provides strategic benefits.

For example, all those are facilitated formulation of the strategy,

that is, to which clients,

markets, or with which products to go.

There are two main objectives in data mining, on the first place,

comes prediction, which often refers to supervised data mining.

And on the second place we have description.

It includes unsupervised aspects and visualization of data mining.

We speak of supervised method when starting from a prior knowledge of the data.

If we don't have prior knowledge of data we shall use unsupervised methods,

where groups of values,

are automatically searched for.

So, users try to find correspondences between these automatically selected groups,

and the categories that may be of interest.

Now we will see that predictive data mining tasks come up with a model from

an available data set to predict unknown or future values of another data set.

For instance a medical practitioner trying to

diagnose a disease based on the medical test results of a patient.

And descriptive data mining tasks,

find data describing patterns and come up with

new significant deformation from the available data sets.

For instance, a retailer trying to identify products that are purchased together.

There are a number of data mining tasks such as classification, prediction,

time series analysis, association,

clustering, summarization, et cetera.

All these tasks are either predictive data mining task or descriptive data mining task.

Our data mining system can execute one or

more of the above the specific task as part of data mining.

I will explain some of this data mining task as follows.

In the case of predictive analysis,

a classification task derives a model to

determine the class of an object based on its attributes.

A collection of records needs to be available.

One of the attributes of the record will be a class attribute.

The goal of classification task is assigning

a class attribute to new set of records as accurately as possible.

For instance, classification can be used in

direct marketing to know which customers purchased similar products,

and then promotion mails can be sent to them directly.

We can see in the figure the classification parts from a set of data,

which is prepare for the data mining task and divided into a training and testing sets.

The training will be the input to

the classification algorithm in order to create a predictive model.

When the model is ready,

it will be evaluated by using the testing set as

an input and verifying if the outcomes were how was suspected.

Remember that in the input file,

we have the data and the corresponding classification already identified previously,

because we start from previous knowledge.

Prediction task predicts the possible values of missing or future data.

Prediction involves developing a model based on the available data,

and this model is used in predicting future values of a new data set of interest.

For example, a model can predict the income of an employee based on education,

experience, and other demographic factors like place of stay, gender, etc.

Also, prediction analysis is used in

different areas including medical diagnosis, fraud detection, etc.

The Predictive task: Time series,

is a sequence of events where the next event is

determined by one or more of the preceding events.

Time series reflect the process being measured,

and there are certain components that affect the behavior of a process.

Time series analysis includes methods to analyze

time-series data in order to extract useful patterns,

trends, rules and statistics.

Stock market prediction is an important application of time-series analysis.

The descriptive tasks association discovers

association or connection among a set of items.

Association identifies the relationships between objects.

Association analysis is used for commodity management,

advertising, catalog design, direct marketing, etc.

A retailer can identify the products that normally customers purchased together,

or even find the customers who respond to the promotion of same kind of products.

The descriptive task clustering is similar to classification,

except that the groups are not predefined.

Clustering is used to identify data objects that are similar to one another.

The similarity can be decided based on a number of factors like purchase behavior,

responsiveness to certain actions,

geographical locations, and so on.

For example, an insurance company can cluster

its customers based on age, residence, income, etc.

This group information will be helpful to understand the customers better,

and hence provide better customized services.

The summarization task is descriptive and is the generalization of data.

A set of relevant data is summarized,

and the result is a smaller set that gives aggregated information of the data.

For example, the shopping done by a customer can be summarized into total products,

total spending, offers used, etc.

Such high-level summarized information can be useful for

sales or customer relationship team for retail customer and purchase behavior analysis.

Data can be summarized in different abstraction levels and from different angles.

Here we can see a number of tasks and techniques.

We will learn some of the model representative and used task and techniques.

In the case of classification task,

we will learn the ID3,

decision tree, Naive Bayes,

and K-nearest neighbor, which is also a regression task.

In the case of clustering task,

we will learn the K-means algorithm.

Now we will see how the K-means algorithm works.

K-means clustering is a type of unsupervised learning,

which is used when you have unlabeled data.

For example, data without defined categories of groups.

The goal of this algorithm is to find groups in the data,

with a number of groups represented by the variable K. The algorithm works

iteratively to assign each data point to one

of K groups based on the features that are provided.

Data points are clustered based on the feature similarity.

The results of the K-means clustering algorithms are: first,

the centroids of the K clusters,

which can be used to label new data.

Second, labels for the training data,

each data point is assigned to a single cluster.

Each centroid of a cluster is a collection of

feature values which defined the resulting groups.

Examining the centroid feature weights can be used to

qualitatively interpret what kind of group each cluster represents.

The K-means pseudocode is as follows: first,

selected points as initial center.

Second, repeat.

Third, form K clusters,

assigning each point to its nearest center.

Fourth, recalculate the centers of each cluster.

Fifth, until the centers to not change.

To assign the points to the nearest centers,

a measure of proximity is used to determine how "close" the data is to the centers.

The most used measure of proximity is Euclidean distance,

but other measures of proximity can be used,

such as Manhattan distance or the distance of Cosine.

The latter is usually used to measure similarity between documents.

To ensure that each point is assigned to

its cluster center and that the quality of the clustering is good,

an objective function is used that tries to

guarantee the minimum proximity between points and centers.

This objective function is the sum of the square error,

which is defined as: