From: "Saved by Windows Internet Explorer 7" Subject: An Overview of Data Mining Techniques Date: Mon, 11 May 2009 11:57:39 -0400 MIME-Version: 1.0 Content-Type: multipart/related; type="text/html"; boundary="----=_NextPart_000_0000_01C9D22F.AF24CC10" X-MimeOLE: Produced By Microsoft MimeOLE V6.0.6001.18049 This is a multi-part message in MIME format. ------=_NextPart_000_0000_01C9D22F.AF24CC10 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Location: http://www.thearling.com/text/dmtechniques/dmtechniques.htm
An Overview of=20 Data Mining Techniques
Excerpted from the =
book Building Data Mining Applications for CRM
by =
Alex=20
Berson, Stephen Smith, and Kurt Thearling
This overview provides a description of some of = the most=20 common data mining algorithms in use today. We have broken = the=20 discussion into two sections, each with a specific theme:
Classical Techniques: Statistics, Neighborhoods = and=20 Clustering
Next Generation Techniques: Trees, Networks and = Rules=20
Each section will describe a number of data = mining=20 algorithms at a high level, focusing on the "big picture" so that the = reader=20 will be able to understand how each algorithm fits into the landscape of = data=20 mining techniques. Overall, six broad classes of data mining = algorithms are covered. Although there are a number of other = algorithms=20 and many variations of the techniques described, one of the algorithms = from this=20 group of six is almost always used in real world deployments of data = mining=20 systems.
These two sections have been broken up based on = when the=20 data mining technique was developed and when it became technically = mature enough=20 to be used for business, especially for aiding in the optimization of = customer=20 relationship management systems. Thus this section contains = descriptions=20 of techniques that have classically been used for decades the next = section=20 represents techniques that have only been widely used since the early = 1980s.
This section should help the user to understand = the rough=20 differences in the techniques and at least enough information to be = dangerous=20 and well armed enough to not be baffled by the vendors of = different data=20 mining tools.
The main techniques that we will discuss here are = the ones=20 that are used 99.9% of the time on existing business problems. = There are=20 certainly many other ones as well as proprietary techniques from = particular=20 vendors - but in general the industry is converging to those techniques = that=20 work consistently and are understandable and explainable.
By strict definition "statistics" or statistical = techniques=20 are not data mining. They were being used long before the term = data mining=20 was coined to apply to business applications. However, statistical = techniques are driven by the data and are used to discover patterns and = build=20 predictive models. And from the users perspective you will be = faced with a=20 conscious choice when solving a "data mining" problem as to whether you = wish to=20 attack it with statistical methods or other data mining = techniques. For=20 this reason it is important to have some idea of how statistical = techniques work=20 and how they can be applied.
I flew the Boston to=20
He explained to me that they not only now were = storing the=20 information on the flies but also were doing "data mining" adding as an = aside=20 "which seems to be very important these days whatever that is". I=20 mentioned that I had written a book on the subject and he was interested = in=20 knowing what the difference was between "data mining" and = statistics. =20 There was no easy answer.
The techniques used in data mining, when = successful, are=20 successful for precisely the same reasons that statistical techniques = are=20 successful (e.g. clean data, a well defined target to predict and good=20 validation to avoid overfitting). And for the most part the = techniques are=20 used in the same places for the same types of problems (prediction,=20 classification discovery). In fact some of the techniques that are = classical defined as "data mining" such as CART and CHAID arose from=20 statisticians.
So what is the difference? Why aren't we as = excited=20 about "statistics" as we are about data mining? There are several=20 reasons. The first is that the classical data mining techniques = such as=20 CART, neural networks and nearest neighbor techniques tend to be more = robust to=20 both messier real world data and also more robust to being used by less = expert=20 users. But that is not the only reason. The other reason is = that the=20 time is right. Because of the use of computers for closed loop = business=20 data storage and generation there now exists large quantities of data = that is=20 available to users. IF there were no data - there would be no = interest in=20 mining it. Likewise the fact that computer hardware has = dramatically upped=20 the ante by several orders of magnitude in storing and processing the = data makes=20 some of the most powerful data mining techniques feasible today.
The bottom line though, from an academic = standpoint at=20 least, is that there is little practical difference between a = statistical=20 technique and a classical data mining technique. Hence we have = included a=20 description of some of the most useful in this section.
Statistics is a branch of mathematics concerning = the=20 collection and the description of data. Usually statistics is = considered=20 to be one of those scary topics in college right up there with chemistry = and=20 physics. However, statistics is probably a much friendlier branch = of=20 mathematics because it really can be used every day. Statistics = was in=20 fact born from very humble beginnings of real world problems from = business,=20 biology, and gambling!
Knowing statistics in your everyday life = will help=20 the average business person make better decisions by allowing them to = figure out=20 risk and uncertainty when all the facts either aren=92t known or can=92t = be=20 collected. Even with all the data stored in the largest of data = warehouses=20 business decisions still just become more informed guesses. The = more and=20 better the data and the better the understanding of statistics the = better the=20 decision that can be made.
Statistics has been around for a long time easily = a century=20 and arguably many centuries when the ideas of probability began to = gel. It=20 could even be argued that the data collected by the ancient Egyptians,=20 Babylonians, and Greeks were all statistics long before the field was = officially=20 recognized. Today data mining has been defined independently of = statistics=20 though =93mining data=94 for patterns and predictions is really what = statistics is=20 all about. Some of the techniques that are classified under data = mining=20 such as CHAID and CART really grew out of the statistical profession = more than=20 anywhere else, and the basic ideas of probability, independence and = causality=20 and overfitting are the foundation on which both data mining and = statistics are=20 built.
One thing that is always true about statistics is = that there is always data involved, and usually enough data = so that=20 the average person cannot keep track of all the data in their = heads. =20 This is certainly more true today than it was when the basic ideas of=20 probability and statistics were being formulated and refined early this=20 century. Today people have to deal with up to terabytes of data = and have=20 to make sense of it and glean the important patterns from it. = Statistics=20 can help greatly in this process by helping to answer several important=20 questions about your data:
What = patterns are=20 there in my database?
What is = the chance=20 that an event will occur?
Which = patterns are=20 significant?
What is = a high level=20 summary of the data that gives me some idea of what is contained in my = database?
Certainly statistics can do more than answer = these=20 questions but for most people today these are the questions that = statistics can=20 help answer. Consider for example that a large part of statistics = is=20 concerned with summarizing data, and more often than not, this = summarization has=20 to do with counting. One of the great values of statistics = is in=20 presenting a high level view of the database that provides some useful=20 information without requiring every record to be understood in = detail. =20 This aspect of statistics is the part that people run into every day = when they=20 read the daily newspaper and see, for example, a pie chart reporting the = number=20 of US citizens of different eye colors, or the average number of annual = doctor=20 visits for people of different ages. Statistics at this = level is=20 used in the reporting of important information from which people may be = able to=20 make useful decisions. There are many different parts of = statistics=20 but the idea of collecting data and counting it is often at the base of = even=20 these more sophisticated techniques. The first step then in = understanding=20 statistics is to understand how the data is collected into a = higher level=20 form - one of the most notable ways of doing this is with the = histogram.
One of the best ways to summarize data is to =
provide a=20
histogram of the data. In the simple example database shown in =
Table 1.1=20
we can create a histogram of eye color by counting the number of =
occurrences of=20
different colors of eyes in our database. For this example =
database of 10=20
records this is fairly easy to do and the results are only slightly more =
interesting than the database itself. However, for a =
database of=20
many more records this is a very useful way of getting a high level=20
understanding of the database.
|
ID |
Name |
Prediction |
Age |
Balance |
Income |
Eyes |
Gender |
|
1 |
Amy |
No |
62 |
$0 |
Medium |
Brown |
F |
|
2 |
Al |
No |
53 |
$1,800 |
Medium |
Green |
M |
|
3 |
Betty |
No |
47 |
$16,543 |
High |
Brown |
F |
|
4 |
Bob |
Yes |
32 |
$45 |
Medium |
Green |
M |
|
5 |
Carla |
Yes |
21 |
$2,300 |
High |
Blue |
F |
|
6 |
Carl |
No |
27 |
$5,400 |
High |
Brown |
M |
|
7 |
Donna |
Yes |
50 |
$165 |
Low |
Blue |
F |
|
8 |
Don |
Yes |
46 |
$0 |
High |
Blue |
M |
|
9 |
Edna |
Yes |
27 |
$500 |
Low |
Blue |
F |
|
10 |
Ed |
No |
68 |
$1,200 |
Low |
Blue |
M |
Table = 1.1 An=20 Example Database of Customers with Different Predictor Types
This histogram shown in figure 1.1 depicts a = simple=20 predictor (eye color) which will have only a few different values no = matter if=20 there are 100 customer records in the database or 100 million. = There are,=20 however, other predictors that have many more distinct values and can = create a=20 much more complex histogram. Consider, for instance, the histogram = of ages=20 of the customers in the population. In this case the histogram can = be more=20 complex but can also be enlightening. Consider if you found that = the=20 histogram of your customer data looked as it does in figure 1.2.

Figure = 1.1 This=20 histogram shows the number of customers with various eye colors. = This=20 summary can quickly show important information about the database such = as that=20 blue eyes are the most frequent.
=20
Figure = 1.2 =20 This histogram shows the number of customers of different ages and = quickly=20 tells the viewer that the majority of customers are over the age of = 50.
By looking at this second histogram the viewer is = in many=20 ways looking at all of the data in the database for a particular = predictor or=20 data column. By looking at this histogram it is also possible to = build an=20 intuition about other important factors. Such as the average age = of the=20 population, the maximum and minimum age. All of which are = important. =20 These values are called summary statistics. Some of the most = frequently=20 used summary statistics include:
Max - = the maximum=20 value for a given predictor.
Min - = the minimum=20 value for a given predictor.
Mean - = the average=20 value for a given predictor.
Median - = the value=20 for a given predictor that divides the database as nearly as possible = into two=20 databases of equal numbers of records.
Mode - = the most=20 common value for the predictor.
Variance = - the=20 measure of how spread out the values are from the average value. =
When there are many values for a given predictor = the=20 histogram begins to look smoother and smoother (compare the difference = between=20 the two histograms above). Sometimes the shape of the distribution = of data=20 can be calculated by an equation rather than just represented by the=20 histogram. This is what is called a data distribution. Like = a=20 histogram a data distribution can be described by a variety of = statistics. =20 In classical statistics the belief is that there is some =93true=94 = underlying shape=20 to the data distribution that would be formed if all possible data was=20 collected. The shape of the data distribution can be calculated = for some=20 simple examples. The statistician=92s job then is to take the limited = data that=20 may have been collected and from that make their best guess at what the = =93true=94=20 or at least most likely underlying data distribution might be.
Many data distributions are well described by = just two=20 numbers, the mean and the variance. The mean is something most = people are=20 familiar with, the variance, however, can be problematic. The = easiest way=20 to think about it is that it measures the average distance of each = predictor=20 value from the mean value over all the records in the database. If = the=20 variance is high it implies that the values are all over the place and = very=20 different. If the variance is low most of the data values are = fairly close=20 to the mean. To be precise the actual definition of the variance = uses the=20 square of the distance rather than the actual distance from the mean and = the=20 average is taken by dividing the squared sum by one less than the total = number=20 of records. In terms of prediction a user could make some guess at = the=20 value of a predictor without knowing anything else just by knowing the = mean and=20 also gain some basic sense of how variable the guess might be based on = the=20 variance.
In this book the term =93prediction=94 is used = for a variety of=20 types of analysis that may elsewhere be more precisely called = regression. =20 We have done so in order to simplify some of the concepts and to = emphasize the=20 common and most important aspects of predictive modeling. = Nonetheless=20 regression is a powerful and commonly used tool in statistics and it = will be=20 discussed here.
In statistics prediction is usually synonymous = with=20 regression of some form. There are a variety of different = types of=20 regression in statistics but the basic idea is that a model is created = that maps=20 values from predictors in such a way that the lowest error occurs in = making a=20 prediction. The simplest form of regression is simple linear = regression=20 that just contains one predictor and a prediction. The = relationship=20 between the two can be mapped on a two dimensional space and the records = plotted=20 for the prediction values along the Y axis and the predictor values = along the X=20 axis. The simple linear regression model then could be viewed as = the line=20 that minimized the error rate between the actual prediction value and = the point=20 on the line (the prediction from the model). Graphically this = would look=20 as it does in Figure 1.3. The simplest form of regression seeks to build = a=20 predictive model that is a line that maps between each predictor value = to a=20 prediction value. Of the many possible lines that could be drawn = through=20 the data the one that minimizes the distance between the line and the = data=20 points is the one that is chosen for the predictive model.
On average if you guess the value on the line it = should=20 represent an acceptable compromise amongst all the data at that point = giving=20 conflicting answers. Likewise if there is no data available for a=20 particular input value the line will provide the best guess at a = reasonable=20 answer based on similar data.
=20
Figure = 1.3 Linear=20 regression is similar to the task of finding the line that minimizes the = total=20 distance to a set of data.
The predictive model is the line shown in Figure = 1.3. =20 The line will take a given value for a predictor and map it into a given = value=20 for a prediction. The actual equation would look something like:=20 Prediction =3D a + b * Predictor. Which is just the equation for a = line Y =3D=20 a + bX. As an example for a bank the predicted average consumer = bank=20 balance might equal $1,000 + 0.01 * customer=92s annual income. = The trick,=20 as always with predictive modeling, is to find the model that best = minimizes the=20 error. The most common way to calculate the error is the square of the=20 difference between the predicted value and the actual value. = Calculated=20 this way points that are very far from the line will have a great effect = on=20 moving the choice of line towards themselves in order to reduce the = error. =20 The values of a and b in the regression equation that minimize this = error can be=20 calculated directly from the data relatively quickly.
Regression can become more complicated than the = simple=20 linear regression we=92ve introduced so far. It can get more = complicated in=20 a variety of different ways in order to better model particular database = problems. There are, however, three main modifications that = can be=20 made:
1. =20
More predictors than just one can be used.
2. =20
Transformations can be applied to the predictors.
3. =20
Predictors can be multiplied together and used as terms in the =
equation.
4. =20
Modifications can be made to accommodate response predictions that just =
have=20
yes/no or 0/1 values.
Adding more predictors to the linear equation can = produce=20 more complicated lines that take more information into account and hence = make a=20 better prediction. This is called multiple linear regression and = might=20 have an equation like the following if 5 predictors were used (X1, X2, = X3, X4,=20 X5):
Y =3D a + b1(X1) + b2(X2) + b3(X3) + =
b4(X4) +=20
b5(X5)
This equation still describes a line but it is = now a line=20 in a6 dimensional space rather than the two dimensional space.
By transforming the predictors by squaring, = cubing or=20 taking their square root it is possible to use the same general = regression=20 methodology and now create much more complex models that are no longer = simple=20 shaped like lines. This is called non-linear regression. A = model of=20 just one predictor might look like this: Y =3D a + b1(X1) + b2 = (X12). =20 In many real world cases analysts will perform a wide variety of = transformations=20 on their data just to try them out. If they do not contribute to a = useful=20 model their coefficients in the equation will tend toward zero and then = they can=20 be removed. The other transformation of predictor values that is = often=20 performed is multiplying them together. For instance a new = predictor=20 created by dividing hourly wage by the minimum wage might be a much more = effective predictor than hourly wage by itself.
When trying to predict a customer response that = is just yes=20 or no (e.g. they bought the product or they didn=92t or they defaulted = or they=20 didn=92t) the standard form of a line doesn=92t work. Since there = are only two=20 possible values to be predicted it is relatively easy to fit a line = through=20 them. However, that model would be the same no matter what = predictors were=20 being used or what particular data was being used. Typically in = these=20 situations a transformation of the prediction values is made in order to = provide=20 a better predictive model. This type of regression is called = logistic=20 regression and because so many business problems are response problems, = logistic=20 regression is one of the most widely used statistical techniques for = creating=20 predictive models.
Clustering and the Nearest Neighbor prediction = technique=20 are among the oldest techniques used in data mining. Most people = have an=20 intuition that they understand what clustering is - namely that like = records are=20 grouped or clustered together. Nearest neighbor is a prediction = technique=20 that is quite similar to clustering - its essence is that in order to = predict=20 what a prediction value is in one record look for records with similar = predictor=20 values in the historical database and use the prediction value from the = record=20 that it =93nearest=94 to the unclassified record.
A simple example of clustering would be the = clustering that=20 most people perform when they do the laundry - grouping the permanent = press, dry=20 cleaning, whites and brightly colored clothes is important because they = have=20 similar characteristics. And it turns out they have important = attributes=20 in common about the way they behave (and can be ruined) in the = wash. To=20 =93cluster=94 your laundry most of your decisions are relatively=20 straightforward. There are of course difficult decisions to be = made about=20 which cluster your white shirt with red stripes goes into (since it is = mostly=20 white but has some color and is permanent press). When clustering = is used=20 in business the clusters are often much more dynamic - even changing = weekly to=20 monthly and many more of the decisions concerning which cluster a record = falls=20 into can be difficult.
A simple example of the nearest neighbor = prediction=20 algorithm is that if you look at the people in your neighborhood (in = this case=20 those people that are in fact geographically near to you). You may = notice=20 that, in general, you all have somewhat similar incomes. Thus if = your=20 neighbor has an income greater than $100,000 chances are good that you = too have=20 a high income. Certainly the chances that you have a high income = are=20 greater when all of your neighbors have incomes over $100,000 than if = all of=20 your neighbors have incomes of $20,000. Within your neighborhood = there may=20 still be a wide variety of incomes possible among even your = =93closest=94 =20 neighbors but if you had to predict someone=92s income based on only = knowing their=20 neighbors you=92re best chance of being right would be to predict the = incomes of=20 the neighbors who live closest to the unknown person.
The nearest neighbor prediction algorithm works = in very=20 much the same way except that =93nearness=94 in a database may consist = of a variety=20 of factors not just where the person lives. It may, for = instance, be=20 far more important to know which school someone attended and what degree = they=20 attained when predicting income. The better definition of = =93near=94 might in=20 fact be other people that you graduated from college with rather than = the people=20 that you live next to.
Nearest Neighbor techniques are among the easiest = to use=20 and understand because they work in a way similar to the way that people = think -=20 by detecting closely matching examples. They also perform quite = well in=20 terms of automation, as many of the algorithms are robust with respect = to dirty=20 data and missing data. Lastly they are particularly adept at = performing=20 complex ROI calculations because the predictions are made at a local = level where=20 business simulations could be performed in order to optimize = ROI. As=20 they enjoy similar levels of accuracy compared to other techniques the = measures=20 of accuracy such as lift are as good as from any other.
One of the essential elements underlying the = concept of=20 clustering is that one particular object (whether they be cars, food or=20 customers) can be closer to another object than can some third = object. It=20 is interesting that most people have an innate sense of ordering placed = on a=20 variety of different objects. Most people would agree that an = apple is=20 closer to an orange than it is to a tomato and that a Toyota Corolla is = closer=20 to a Honda Civic than to a Porsche. This sense of ordering on many = different objects helps us place them in time and space and to = make sense=20 of the world. It is what allows us to build clusters - both in = databases=20 on computers as well as in our daily lives. This definition of = nearness=20 that seems to be ubiquitous also allows us to make predictions.
The nearest neighbor prediction algorithm simply = stated=20 is:
Objects that are =93near=94 to each other will = have similar=20 prediction values as well. Thus if you know the prediction value = of one of=20 the objects you can predict it for it=92s nearest neighbors.
One of the classical places that nearest neighbor = has been=20 used for prediction has been in text retrieval. The problem to be = solved=20 in text retrieval is one where the end user defines a document (e.g. = Wall Street=20 Journal article, technical conference paper etc.) that is interesting to = them=20 and they solicit the system to =93find more documents like this = one=94. =20 Effectively defining a target of: =93this is the interesting document=94 = or =93this is=20 not interesting=94. The prediction problem is that only a very few = of the=20 documents in the database actually have values for this prediction field = (namely=20 only the documents that the reader has had a chance to look at so = far). =20 The nearest neighbor technique is used to find other documents that = share=20 important characteristics with those documents that have been marked as=20 interesting.
As with almost all prediction algorithms, nearest = neighbor=20 can be used in a variety of places. Its successful use is = mostly=20 dependent on the pre-formatting of the data so that nearness can be = calculated=20 and where individual records can be defined. In the text retrieval = example=20 this was not too difficult - the objects were documents. This is not = always as=20 easy as it is for text retrieval. Consider what it might be like in a = time=20 series problem - say for predicting the stock market. In this case = the=20 input data is just a long series of stock prices over time without any=20 particular record that could be considered to be an object. = The=20 value to be predicted is just the next value of the stock price.
The way that this problem is solved for both = nearest=20 neighbor techniques and for some other types of prediction algorithms is = to=20 create training records by taking, for instance, 10 consecutive stock = prices and=20 using the first 9 as predictor values and the 10th as the prediction=20 value. Doing things this way, if you had 100 data points in your = time=20 series you could create 10 different training records.
You could create even more training records than = 10 by=20 creating a new record starting at every data point. For = instance in=20 the you could take the first 10 data points and create a record. = Then you=20 could take the 10 consecutive data points starting at the second data = point,=20 then the 10 consecutive data point starting at the third data = point. Even=20 though some of the data points would overlap from one record to = the next=20 the prediction value would always be different. In our example of = 100=20 initial data points 90 different training records could be created this = way as=20 opposed to the 10 training records created via the other method.
One of the improvements that is usually made to = the basic=20 nearest neighbor algorithm is to take a vote from the =93K=94 nearest = neighbors=20 rather than just relying on the sole nearest neighbor to the = unclassified=20 record. In Figure 1.4 we can see that unclassified example C has a = nearest=20 neighbor that is a defaulter and yet is surrounded almost exclusively by = records=20 that are good credit risks. In this case the nearest neighbor to = record C=20 is probably an outlier - which may be incorrect data or some = non-repeatable=20 idiosyncrasy. In either case it is more than likely that C is a=20 non-defaulter yet would be predicted to be a defaulter if the sole = nearest=20 neighbor were used for the prediction.
=20
Figure = 1.4 =20 The nearest neighbors are shown graphically for three unclassified = records:=20 A, B, and C.
In cases like these a vote of the 9 or 15 nearest = neighbors=20 would provide a better prediction accuracy for the system than would = just the=20 single nearest neighbor. Usually this is accomplished by simply = taking the=20 majority or plurality of predictions from the K nearest neighbors if the = prediction column is a binary or categorical or taking the average value = of the=20 prediction column from the K nearest neighbors.
Another important aspect of any system that is = used to make=20 predictions is that the user be provided with, not only the prediction, = but also=20 some sense of the confidence in that prediction (e.g. the prediction is=20 defaulter with the chance of being correct 60% of the time). The = nearest=20 neighbor algorithm provides this confidence information in a number of = ways:
The distance to the nearest neighbor provides a = level of=20 confidence. If the neighbor is very close or an exact match then = there is=20 much higher confidence in the prediction than if the nearest record is a = great=20 distance from the unclassified record.
The degree of homogeneity amongst the predictions = within=20 the K nearest neighbors can also be used. If all the nearest = neighbors=20 make the same prediction then there is much higher confidence in the = prediction=20 than if half the records made one prediction and the other half made = another=20 prediction.
Clustering is the method by which like records = are grouped=20 together. Usually this is done to give the end user a high level = view of=20 what is going on in the database. Clustering is sometimes used to = mean=20 segmentation - which most marketing people will tell you is useful for = coming up=20 with a birds eye view of the business. Two of these clustering = systems are=20 the PRIZM=99 system from Claritas corporation and MicroVision=99 from = Equifax=20 corporation. These companies have grouped the population by = demographic=20 information into segments that they believe are useful for direct = marketing and=20 sales. To build these groupings they use information such as = income, age,=20 occupation, housing and race collect in the US Census. Then they = assign=20 memorable =93nicknames=94 to the clusters. Some examples are shown = in Table=20 1.2.
|
Name |
Income |
Age |
Education = |
Vendor |
|
Blue Blood Estates |
Wealthy |
35-54 |
College |
Claritas Prizm=99 |
|
Shotguns and Pickups |
Middle |
35-64 |
High School |
Claritas Prizm=99 |
|
|
Poor |
Mix |
Grade School |
Claritas Prizm=99 |
|
Living Off the Land |
Middle-Poor |
School Age Families |
Low |
Equifax MicroVision=99 |
|
University=20
|
Very low |
Young - Mix |
Medium to High |
Equifax MicroVision=99 |
|
Sunset Years |
Medium |
Seniors |
Medium |
Equifax=20 MicroVision=99 |
Table = 1.2 Some=20 Commercially Available Cluster Tags
This clustering information is then used by the = end user to=20 tag the customers in their database. Once this is done the = business user=20 can get a quick high level view of what is happening within the cluster. = Once=20 the business user has worked with these codes for some time they also = begin to=20 build intuitions about how these different customers clusters will react = to the=20 marketing offers particular to their business. For instance = some of=20 these clusters may relate to their business and some of them may = not. But=20 given that their competition may well be using these same clusters to = structure=20 their business and marketing offers it is important to be aware of how = you=20 customer base behaves in regard to these clusters.
Sometimes clustering is performed not so much to = keep=20 records together as to make it easier to see when one record sticks out = from the=20 rest. For instance:
Most wine distributors selling inexpensive wine =
in=20
A sale on men=92s suits is being held in all =
branches of a=20
department store for southern =
The nearest neighbor algorithm is basically a = refinement of=20 clustering in the sense that they both use distance in some feature = space to=20 create either structure in the data or predictions. The nearest = neighbor=20 algorithm is a refinement since part of the algorithm usually is a way = of=20 automatically determining the weighting of the importance of the = predictors and=20 how the distance will be measured within the feature space. = Clustering is=20 one special case of this where the importance of each predictor is = considered to=20 be equivalent.
To see clustering and nearest neighbor prediction = in use=20 let=92s go back to our example database and now look at it in two = ways. =20 First let=92s try to create our own clusters - which if useful we could = use=20 internally to help to simplify and clarify large quantities of data (and = maybe=20 if we did a very good job sell these new codes to other business = users). =20 Secondly let=92s try to create predictions based on the nearest = neighbor.
First take a look at the data. How would = you cluster=20 the data in Table 1.3?
|
ID |
Name |
Prediction |
Age |
Balance |
Income |
Eyes |
Gender |
|
1 |
Amy |
No |
62 |
$0 |
Medium |
Brown |
F |
|
2 |
Al |
No |
53 |
$1,800 |
Medium |
Green |
M |
|
3 |
Betty |
No |
47 |
$16,543 |
High |
Brown |
F |
|
4 |
Bob |
Yes |
32 |
$45 |
Medium |
Green |
M |
|
5 |
Carla |
Yes |
21 |
$2,300 |
High |
Blue |
F |
|
6 |
Carl |
No |
27 |
$5,400 |
High |
Brown |
M |
|
7 |
Donna |
Yes |
50 |
$165 |
Low |
Blue |
F |
|
8 |
Don |
Yes |
46 |
$0 |
High |
Blue |
M |
|
9 |
Edna |
Yes |
27 |
$500 |
Low |
Blue |
F |
|
10 |
Ed |
No |
68 |
$1,200 |
Low |
Blue |
M |
Table = 1.3 A=20 Simple Example Database
If these were your friends rather than your = customers=20 (hopefully they could be both) and they were single, you might cluster = them=20 based on their compatibility with each other. Creating your own = mini=20 dating service. If you were a pragmatic person you might cluster = your=20 database as follows because you think that marital happiness is mostly = dependent=20 on financial compatibility and create three clusters as shown in Table = 1.4.
|
ID |
Name |
Prediction |
Age |
Balance |
Income |
Eyes |
Gender |
|
3 |
Betty |
No |
47 |
$16,543 |
High |
Brown |
F |
|
5 |
Carla |
Yes |
21 |
$2,300 |
High |
Blue |
F |
|
6 |
Carl |
No |
27 |
$5,400 |
High |
Brown |
M |
|
8 |
Don |
Yes |
46 |
$0 |
High |
Blue |
M |
|
1 |
Amy |
No |
62 |
$0 |
Medium |
Brown |
F |
|
2 |
Al |
No |
53 |
$1,800 |
Medium |
Green |
M |
|
4 |
Bob |
Yes |
32 |
$45 |
Medium |
Green |
M |
|
7 |
Donna |
Yes |
50 |
$165 |
Low |
Blue |
F |
|
9 |
Edna |
Yes |
27 |
$500 |
Low |
Blue |
F |
|
10 |
Ed |
No |
68 |
$1,200 |
Low |
Blue |
M |
If on the other hand you are more of a romantic = you might=20 note some incompatibilities between 46 year old Don and 21 year old = Carla (even=20 though they both make very good incomes). You might instead = consider age=20 and some physical characteristics to be most important in creating = clusters of=20 friends. Another way you could cluster your friends would be based = on=20 their ages and on the color of their eyes. This is shown in Table=20 1.5. Here three clusters are created where each person in the = cluster is=20 about the same age and some attempt has been made to keep people of like = eye=20 color together in the same cluster.
|
ID |
Name |
Prediction |
Age |
Balance |
Income |
Eyes |
Gender |
|
5 |
Carla |
Yes |
21 |
$2,300 |
High |
Blue |
F |
|
9 |
Edna |
Yes |
27 |
$500 |
Low |
Blue |
F |
|
6 |
Carl |
No |
27 |
$5,400 |
High |
Brown |
M |
|
4 |
Bob |
Yes |
32 |
$45 |
Medium |
Green |
M |
|
8 |
Don |
Yes |
46 |
$0 |
High |
Blue |
M |
|
7 |
Donna |
Yes |
50 |
$165 |
Low |
Blue |
F |
|
10 |
Ed |
No |
68 |
$1,200 |
Low |
Blue |
M |
|
3 |
Betty |
No |
47 |
$16,543 |
High |
Brown |
F |
|
2 |
Al |
No |
53 |
$1,800 |
Medium |
Green |
M |
|
1 |
Amy |
No |
62 |
$0 |
Medium |
Brown |
F |
There is no best way to cluster.
This example, though simple, points up some = important=20 questions about clustering. For instance: Is it possible to say = whether=20 the first clustering that was performed above (by financial status) was = better=20 or worse than the second clustering (by age and eye color)? = Probably not=20 since the clusters were constructed for no particular purpose except to = note=20 similarities between some of the records and that the view of the = database could=20 be somewhat simplified by using clusters. But even the differences = that=20 were created by the two different clusterings were driven by slightly = different=20 motivations (financial vs. Romantic). In general the reasons for=20 clustering are just this ill defined because clusters are used more = often than=20 not for exploration and summarization as much as they are used for=20 prediction.
Notice that for the first clustering example = there was a=20 pretty simple rule by which the records could be broken up into clusters = -=20 namely by income. In the second clustering example there were less = clear=20 dividing lines since two predictors were used to form the clusters (age = and eye=20 color). Thus the first cluster is dominated by younger = people with=20 somewhat mixed eye colors whereas the latter two clusters have a mix of = older=20 people where eye color has been used to separate them out (the second = cluster is=20 entirely blue eyed people). In this case these tradeoffs = were made=20 arbitrarily but when clustering much larger numbers of records these = tradeoffs=20 are explicitly defined by the clustering algorithm.
In the best possible case clusters would be built = where all=20 records within the cluster had identical values for the particular = predictors=20 that were being clustered on. This would be the optimum in = creating a high=20 level view since knowing the predictor values for any member of the = cluster=20 would mean knowing the values for every member of the cluster no matter = how=20 large the cluster was. Creating homogeneous clusters where all = values for=20 the predictors are the same is difficult to do when there are many = predictors=20 and/or the predictors have many different values (high = cardinality).
It is possible to guarantee that homogeneous = clusters are=20 created by breaking apart any cluster that is inhomogeneous into smaller = clusters that are homogeneous. In the extreme, though, this = usually means=20 creating clusters with only one record in them which usually defeats the = original purpose of the clustering. For instance in our 10 record = database=20 above 10 perfectly homogeneous clusters could be formed of 1 record = each, but=20 not much progress would have been made in making the original database = more=20 understandable.
The second important constraint on clustering is = then that=20 a reasonable number of clusters are formed. Where, again, = reasonable is=20 defined by the user but is difficult to quantify beyond that except to = say that=20 just one cluster is unacceptable (too much generalization) and that as = many=20 clusters and original records is also unacceptable Many clustering = algorithms either let the user choose the number of clusters that they = would=20 like to see created from the database or they provide the user a = =93knob=94 by which=20 they can create fewer or greater numbers of clusters interactively after = the=20 clustering has been performed.
The main distinction between clustering and the =
nearest=20
neighbor technique is that clustering is what is called an unsupervised =
learning=20
technique and nearest neighbor is generally used for prediction or a =
supervised=20
learning technique. Unsupervised learning techniques are =
unsupervised in=20
the sense that when they are run there is not particular reason for the =
creation=20
of the models the way there is for supervised learning techniques that =
are=20
trying to perform prediction. In prediction, the patterns that are =
found=20
in the database and presented in the model are always the most important =
patterns in the database for performing some particular =
prediction. In=20
clustering there is no particular sense of why certain records are near =
to each=20
other or why they all fall into the same cluster. Some of the =
differences=20
between clustering and nearest neighbor prediction can be summarized in =
Table=20
1.6.
|
Nearest =
Neighbor |
Clustering |
|
Used for prediction as well as = consolidation. |
Used mostly for consolidating data into a = high-level=20 view and general grouping of records into like = behaviors. |
|
Space is defined by the problem to be solved=20 (supervised learning). |
Space is defined as default = n-dimensional space,=20 or is defined by the user, or is a predefined space driven by past = experience (unsupervised learning). |
|
Generally only uses distance metrics to = determine=20 nearness. |
Can use other metrics besides distance to = determine=20 nearness of two records - for example linking two points=20 together. |
Table = 1.6 =20 Some of the Differences Between the Nearest-Neighbor Data Mining = Technique=20 and Clustering
When people talk about clustering or nearest = neighbor=20 prediction they will often talk about a =93space=94 of =93N=94 = dimensions. What=20 they mean is that in order to define what is near and what is far away = it is=20 helpful to have a =93space=94 defined where distance can be = calculated. =20 Generally these spaces behave just like the three dimensional space that = we are=20 familiar with where distance between objects is defined by euclidean = distance=20 (just like figuring out the length of a side in a triangle).
What goes for three dimensions works pretty well = for more=20 dimensions as well. Which is a good thing since most real world = problems=20 consists of many more than three dimensions. In fact each = predictor (or=20 database column) that is used can be considered to be a new = dimension. In=20 the example above the five predictors: age, income, balance, eyes and = gender can=20 all be construed to be dimensions in an n dimensional space where n, in = this=20 case, equal 5. It is sometimes easier to think about these and = other data=20 mining algorithms in terms of n-dimensional spaces because it allows for = some=20 intuitions to be used about how the algorithm is working.
Moving from three dimensions to five dimensions = is not too=20 large a jump but there are also spaces in real world problems that are = far more=20 complex. In the credit card industry credit card issuers typically = have=20 over one thousand predictors that could be used to create an = n-dimensional=20 space. For text retrieval (e.g. finding useful Wall Street Journal = articles from a large database, or finding useful web sites on the = internet) the=20 predictors (and hence the dimensions) are typically words or phrases = that are=20 found in the document records. In just one year of the Wall = Street=20 Journal there are more than 50,000 different words used - which = translates to a=20 50,000 dimensional space in which nearness between records must be=20 calculated.
For clustering the n-dimensional space is usually = defined=20 by assigning one predictor to each dimension. For the nearest = neighbor=20 algorithm predictors are also mapped to dimensions but then those = dimensions are=20 literally stretched or compressed based on how important the particular=20 predictor is in making the prediction. The stretching of a = dimension=20 effectively makes that dimension (and hence predictor) more important = than the=20 others in calculating the distance.
For instance if you are a mountain climber and = someone told=20 you that you were 2 miles from your destination the distance is the same = whether=20 it=92s 1 mile north and 1 mile up the face of the mountain or 2 miles = north on=20 level ground but clearly the former route is much different from the = latter. The=20 distance traveled straight upward is the most important if figuring out = how long=20 it will really take to get to the destination and you would probably = like to=20 consider this =93dimension=94 to be more important than the = others. In=20 fact you, as a mountain climber, could =93weight=94 the importance of = the vertical=20 dimension in calculating some new distance by reasoning that every mile = upward=20 is equivalent to 10 miles on level ground.
If you used this rule of thumb to weight the = importance of=20 one dimension over the other it would be clear that in one case you were = much=20 =93further away=94 from your destination (=9311 miles=94) than in the = second (=932=20 miles=94). In the next section we=92ll show how the nearest = neighbor algorithm=20 uses distance measure that similarly weight the important dimensions = more=20 heavily when calculating a distance measure.
There are two main types of clustering = techniques, those=20 that create a hierarchy of clusters and those that do not. The=20 hierarchical clustering techniques create a hierarchy of clusters from = small to=20 big. The main reason for this is that, as was already = stated, =20 clustering is an unsupervised learning technique, and as such, there is = no=20 absolutely correct answer. For this reason and depending on the = particular=20 application of the clustering, fewer or greater numbers of clusters may = be=20 desired. With a hierarchy of clusters defined it is possible to = choose the=20 number of clusters that are desired. At the extreme it is possible = to have=20 as many clusters as there are records in the database. In = this case=20 the records within the cluster are optimally similar to each other = (since there=20 is only one) and certainly different from the other clusters. But = of=20 course such a clustering technique misses the point in the sense that = the idea=20 of clustering is to find useful patters in the database that summarize = it and=20 make it easier to understand. Any clustering algorithm that ends = up with=20 as many clusters as there are records has not helped the user understand = the=20 data any better. Thus one of the main points about clustering is = that=20 there be many fewer clusters than there are original records. = Exactly how=20 many clusters should be formed is a matter of interpretation. The=20 advantage of hierarchical clustering methods is that they allow the end = user to=20 choose from either many clusters or only a few.
The hierarchy of clusters is usually viewed as a = tree where=20 the smallest clusters merge together to create the next highest level of = clusters and those at that level merge together to create the next = highest level=20 of clusters. Figure 1.5 below shows how several clusters might = form a=20 hierarchy. When a hierarchy of clusters like this is created the = user can=20 determine what the right number of clusters is that adequately = summarizes the=20 data while still providing useful information (at the other extreme a = single=20 cluster containing all the records is a great summarization but does not = contain=20 enough specific information to be useful).
This hierarchy of clusters is created through the = algorithm=20 that builds the clusters. There are two main types of hierarchical = clustering algorithms:
Agglomerative - Agglomerative clustering =
techniques start=20
with as many clusters as there are records where each cluster contains =
just=20
one record. The clusters that are nearest each other are =
merged=20
together to form the next largest cluster. This merging is =
continued=20
until a hierarchy of clusters is built with just a single cluster =
containing=20
all the records at the top of the hierarchy.
Divisive - Divisive clustering techniques take = the=20 opposite approach from agglomerative techniques. These = techniques start=20 with all the records in one cluster and then try to split that cluster = into=20 smaller pieces and then in turn to try to split those smaller = pieces. =20
Of the two the agglomerative techniques are the = most=20 commonly used for clustering and have more algorithms developed for = them. =20 We=92ll talk about these in more detail in the next section. The = non-hierarchical=20 techniques in general are faster to create from the historical database = but=20 require that the user make some decision about the number of clusters = desired or=20 the minimum =93nearness=94 required for two records to be within the = same=20 cluster. These non-hierarchical techniques often times are run = multiple=20 times starting off with some arbitrary or even random clustering and = then=20 iteratively improving the clustering by shuffling some records = around. Or=20 these techniques some times create clusters that are created with only = one pass=20 through the database adding records to existing clusters when they exist = and=20 creating new clusters when no existing cluster is a good candidate for = the given=20 record. Because the definition of which clusters are formed can depend = on these=20 initial choices of which starting clusters should be chosen or even how = many=20 clusters these techniques can be less repeatable than the hierarchical=20 techniques and can sometimes create either too many or too few clusters = because=20 the number of clusters is predetermined by the user not determined = solely by the=20 patterns inherent in the database.

Figure = 1.5=20 Diagram showing a hierarchy of clusters. Clusters at the lowest = level=20 are merged together to form larger clusters at the next level of the=20 hierarchy.
There are two main non-hierarchical clustering=20 techniques. Both of them are very fast to compute on the database = but have=20 some drawbacks. The first are the single pass methods. They = derive=20 their name from the fact that the database must only be passed through = once in=20 order to create the clusters (i.e. each record is only read from the = database=20 once). The other class of techniques are called reallocation = methods. They get their name from the movement or = =93reallocation=94 of=20 records from one cluster to another in order to create better = clusters. =20 The reallocation techniques do use multiple passes through the database = but are=20 relatively fast in comparison to the hierarchical techniques.
Some techniques allow the user to request the = number of=20 clusters that they would like to be pulled out of the data. = Predefining=20 the number of clusters rather than having them driven by the data might = seem to=20 be a bad idea as there might be some very distinct and observable = clustering of=20 the data into a certain number of clusters which the user might not be = aware=20 of.
For instance the user may wish to see their data = broken up=20 into 10 clusters but the data itself partitions very cleanly into 13=20 clusters. These non-hierarchical techniques will try to shoe horn = these=20 extra three clusters into the existing 10 rather than creating 13 which = best fit=20 the data. The saving grace for these methods, however, is that, as = we have=20 seen, there is no one right answer for how to cluster so it is rare that = by=20 arbitrarily predefining the number of clusters that you would end up = with the=20 wrong answer. One of the advantages of these techniques is that = often=20 times the user does have some predefined level of summarization = that they=20 are interested in (e.g. =9325 clusters is too confusing, but 10 will = help to give=20 me an insight into my data=94). The fact that greater or fewer = numbers of=20 clusters would better match the data is actually of secondary = importance.
Hierarchical clustering has the advantage over=20 non-hierarchical techniques in that the clusters are defined solely by = the data=20 (not by the users predetermining the number of clusters) and that the = number of=20 clusters can be increased or decreased by simple moving up and down the=20 hierarchy.
The hierarchy is created by starting either at = the top (one=20 cluster that includes all records) and subdividing (divisive clustering) = or by=20 starting at the bottom with as many clusters as there are records and = merging=20 (agglomerative clustering). Usually the merging and subdividing = are done=20 two clusters at a time.
The main distinction between the techniques is = their=20 ability to favor long, scraggly clusters that are linked together record = by=20 record, or to favor the detection of the more classical, compact or = spherical=20 cluster that was shown at the beginning of this section. It may = seem=20 strange to want to form these long snaking chain like clusters, but in = some=20 cases they are the patters that the user would like to have detected in = the=20 database. These are the times when the underlying space looks = quite=20 different from the spherical clusters and the clusters that should be = formed are=20 not based on the distance from the center of the cluster but instead = based on=20 the records being =93linked=94 together. Consider the example = shown in Figure=20 1.6 or in Figure 1.7. In these cases there are two clusters that = are not=20 very spherical in shape but could be detected by the single link = technique.
When looking at the layout of the data in = Figure1.6 =20 there appears to be two relatively flat clusters running parallel to = each along=20 the income axis. Neither the complete link nor Ward=92s method = would,=20 however, return these two clusters to the user. These techniques = rely on=20 creating a =93center=94 for each cluster and picking these centers so = that they=20 average distance of each record from this center is minimized. Points = that are=20 very distant from these centers would necessarily fall into a different=20 cluster.
What makes these clusters =93visible=94 in this = simple two=20 dimensional space is the fact that each point in a cluster is tightly = linked to=20 some other point in the cluster. For the two clusters we see the = maximum=20 distance between the nearest two points within a cluster is less than = the=20 minimum distance of the nearest two points in different clusters. = That is=20 to say that for any point in this space, the nearest point to it is = always going=20 to be another point in the same cluster. Now the center of gravity = of a=20 cluster could be quite distant from a given point but that every point = is linked=20 to every other point by a series of small distances.
=20
Figure = 1.6 =20 an example of elongated clusters which would not be recovered by the = complete=20 link or Ward's methods but would be by the single-link method.
=20
Figure = 1.7 =20 An example of nested clusters which would not be recovered by the = complete=20 link or Ward's methods but would be by the single-link method.
There is no particular rule that would tell you =
when to=20
choose a particular technique over another one. Sometimes those =
decisions=20
are made relatively arbitrarily based on the availability of data mining =
analysts who are most experienced in one technique over =
another. And=20
even choosing classical techniques over some of the newer techniques is =
more=20
dependent on the availability of good tools and good analysts. =
Whichever=20
techniques are chosen whether classical or next generation all of the =
techniques=20
presented here have been available and tried for more than two =
decades. So=20
even the next generation is a solid bet for =
implementation.
The data mining techniques in this section = represent the=20 most often used techniques that have been developed over the last two = decades of=20 research. They also represent the vast majority of the techniques = that are=20 being spoken about when data mining is mentioned in the popular = press. =20 These techniques can be used for either discovering new information = within large=20 databases or for building predictive models. Though the older = decision=20 tree techniques such as CHAID are currently highly used the new = techniques such=20 as CART are gaining wider acceptance.
A decision tree is a predictive model that, as = its name=20 implies, can be viewed as a tree. Specifically each branch of the = tree is=20 a classification question and the leaves of the tree are partitions of = the=20 dataset with their classification. For instance if we were going = to=20 classify customers who churn (don=92t renew their phone contracts) in = the Cellular=20 Telephone Industry a decision tree might look something like that found = in=20 Figure 2.1.
=20
Figure 2.1 A decision tree is a predictive model that makes =
a=20
prediction on the basis of a series of decision much like the game =
of 20=20
questions.
You may notice some interesting things about the = tree:
It divides up the data on each branch point = without=20 losing any of the data (the number of total records in a given parent = node is=20 equal to the sum of the records contained in its two children).
The number of churners and non-churners is = conserved as=20 you move up or down the tree
It is pretty easy to understand how the model = is being=20 built (in contrast to the models from neural networks or from standard = statistics).
It would also be pretty easy to use this model = if you=20 actually had to target those customers that are likely to churn with a = targeted marketing offer.
You may also build some intuitions about your = customer=20 base. E.g. =93customers who have been with you for a couple = of years=20 and have up to date cellular phones are pretty loyal=94.
From a business perspective decision trees can be = viewed as=20 creating a segmentation of the original dataset (each segment would be = one of=20 the leaves of the tree). Segmentation of customers, products, and = sales=20 regions is something that marketing managers have been doing for many = years. In=20 the past this segmentation has been performed in order to get a high = level view=20 of a large amount of data - with no particular reason for creating the=20 segmentation except that the records within each segmentation were = somewhat=20 similar to each other.
In this case the segmentation is done for a = particular=20 reason - namely for the prediction of some important piece of = information. =20 The records that fall within each segment fall there because they have=20 similarity with respect to the information being predicted - not just = that they=20 are similar - without similarity being well defined. These = predictive=20 segments that are derived from the decision tree also come with a = description of=20 the characteristics that define the predictive segment. Thus = the=20 decision trees and the algorithms that create them may be complex, the = results=20 can be presented in an easy to understand way that can be quite useful = to the=20 business user.
Because of their tree structure and ability to = easily=20 generate rules decision trees are the favored technique for building=20 understandable models. Because of this clarity they also allow for = more=20 complex profit and ROI models to be added easily in on top of the = predictive=20 model. For instance once a customer population is found with high=20 predicted likelihood to attrite a variety of cost models can be used to = see if=20 an expensive marketing intervention should be used because the customers = are=20 highly valuable or a less expensive intervention should be used because = the=20 revenue from this sub-population of customers is marginal.
Because of their high level of automation and the = ease of=20 translating decision tree models into SQL for deployment in relational = databases=20 the technology has also proven to be easy to integrate with existing IT=20 processes, requiring little preprocessing and cleansing of the data, or=20 extraction of a special purpose file specifically for data mining.
Decision trees are data mining technology that = has been=20 around in a form very similar to the technology of today for almost = twenty years=20 now and early versions of the algorithms date back in the 1960s. = Often=20 times these techniques were originally developed for statisticians to = automate=20 the process of determining which fields in their database were actually = useful=20 or correlated with the particular problem that they were trying to=20 understand. Partially because of this history, decision tree = algorithms=20 tend to automate the entire process of hypothesis generation and then = validation=20 much more completely and in a much more integrated way than any other = data=20 mining techniques. They are also particularly adept at handling = raw data=20 with little or no pre-processing. Perhaps also because they were=20 originally developed to mimic the way an analyst interactively performs = data=20 mining they provide a simple to understand predictive model based on = rules (such=20 as =9390% of the time credit card customers of less than 3 months who = max out=20 their credit limit are going to default on their credit card = loan.=94).
Because decision trees score so highly on so many = of the=20 critical features of data mining they can be used in a wide variety of = business=20 problems for both exploration and for prediction. They have been = used for=20 problems ranging from credit card attrition prediction to time series = prediction=20 of the exchange rate of different international currencies. There = are also=20 some problems where decision trees will not do as well. Some very = simple=20 problems where the prediction is just a simple multiple of the predictor = can be=20 solved much more quickly and easily by linear regression. Usually = the=20 models to be built and the interactions to be detected are much more = complex in=20 real world problems and this is where decision trees excel.
The decision tree technology can be used for = exploration of=20 the dataset and business problem. This is often done by looking at = the=20 predictors and values that are chosen for each split of the tree. = Often=20 times these predictors provide usable insights or propose questions that = need to=20 be answered. For instance if you ran across the following in your = database=20 for cellular phone churn you might seriously wonder about the way your = telesales=20 operators were making their calls and maybe change the way that they are = compensated: =93IF customer lifetime < 1.1 years AND sales channel = =3D telesales=20 THEN chance of churn is 65%.
Another way that the decision tree technology has = been used=20 is for preprocessing data for other prediction algorithms. Because = the=20 algorithm is fairly robust with respect to a variety of predictor types = (e.g.=20 number, categorical etc.) and because it can be run relatively quickly = decision=20 trees can be used on the first pass of a data mining run to create a = subset of=20 possibly useful predictors that can then be fed into neural networks, = nearest=20 neighbor and normal statistical routines - which can take a considerable = amount=20 of time to run if there are large numbers of possible predictors to be = used in=20 the model.
Although some forms of decision trees were = initially=20 developed as exploratory tools to refine and preprocess data for more = standard=20 statistical techniques like logistic regression. They have also = been used=20 and more increasingly often being used for prediction. This is = interesting=20 because many statisticians will still use decision trees for exploratory = analysis effectively building a predictive model as a by product but = then ignore=20 the predictive model in favor of techniques that they are most = comfortable=20 with. Sometimes veteran analysts will do this even excluding the=20 predictive model when it is superior to that produced by other = techniques. =20 With a host of new products and skilled users now appearing this = tendency to use=20 decision trees only for exploration now seems to be changing.
The first step in the process is that of growing = the=20 tree. Specifically the algorithm seeks to create a tree that works = as=20 perfectly as possible on all the data that is available. Most of = the time=20 it is not possible to have the algorithm work perfectly. There is = always=20 noise in the database to some degree (there are variables that are not = being=20 collected that have an impact on the target you are trying to=20 predict).
The name of the game in growing the tree is in = finding the=20 best possible question to ask at each branch point of the tree. At = the=20 bottom of the tree you will come up with nodes that you would like to be = all of=20 one type or the other. Thus the question: =93Are you over 40?=94 = probably does=20 not sufficiently distinguish between those who are churners and those = who are=20 not - let=92s say it is 40%/60%. On the other hand there may be a = series of=20 questions that do quite a nice job in distinguishing those cellular = phone=20 customers who will churn and those who won=92t. Maybe the series = of=20 questions would be something like: =93Have you been a customer for less = than a=20 year, do you have a telephone that is more than two years old and were = you=20 originally landed as a customer via telesales rather than direct = sales?=94 =20 This series of questions defines a segment of the customer population in = which=20 90% churn. These are then relevant questions to be asking in = relation to=20 predicting churn.
The difference between a good question and a bad = question=20 has to do with how much the question can organize the data - or in this = case,=20 change the likelihood of a churner appearing in the customer = segment. If=20 we started off with our population being half churners and half = non-churners=20 then we would expect that a question that didn=92t organize the = data to some=20 degree into one segment that was more likely to churn than the other = then it=20 wouldn=92t be a very useful question to ask. On the other hand if = we asked a=20 question that was very good at distinguishing between churners and = non-churners=20 - say that split 100 customers into one segment of 50 churners and = another=20 segment of 50 non-churners then this would be considered to be a good=20 question. In fact it had decreased the =93disorder=94 of the = original segment=20 as much as was possible.
The process in decision tree algorithms is very = similar=20 when they build trees. These algorithms look at all possible=20 distinguishing questions that could possibly break up the original = training=20 dataset into segments that are nearly homogeneous with respect to the = different=20 classes being predicted. Some decision tree algorithms may use = heuristics in=20 order to pick the questions or even pick them at random. CART = picks the=20 questions in a very unsophisticated way: It tries them all. After = it has=20 tried them all CART picks the best one uses it to split the data into = two more=20 organized segments and then again asks all possible questions on each of = those=20 new segments individually.
If the decision tree algorithm just continued = growing the=20 tree like this it could conceivably create more and more questions and = branches=20 in the tree so that eventually there was only one record in the segment. = To let=20 the tree grow to this size is both computationally expensive but also=20 unnecessary. Most decision tree algorithms stop growing the tree = when one=20 of three criteria are met:
The = segment contains=20 only one record. (There is no further question that you could = ask which=20 could further refine a segment of just one.)
All the = records in=20 the segment have identical characteristics. (There is no reason = to=20 continue asking further questions segmentation since all the remaining = records=20 are the same.)
The = improvement is=20 not substantial enough to warrant making the split.
Consider the following example shown in Table 2.1 =
of a=20
segment that we might want to split further which has just two =
examples. =20
It has been created out of a much larger customer database by selecting =
only=20
those customers aged 27 with blue eyes and salaries between $80,000 and=20
$81,000.
|
Name |
Age |
Eyes |
Salary |
Churned? |
|
Steve |
27 |
Blue |
$80,000 |
Yes |
|
Alex |
27 |
Blue |
$80,000 |
No |
Table = 2.1 =20 Decision tree algorithm segment. This segment cannot be split = further=20 except by using the predictor "name".
In this case all of the possible questions that = could be=20 asked about the two customers turn out to have the same value (age, = eyes,=20 salary) except for name. It would then be possible to ask a = question like:=20 =93Is the customer=92s name Steve?=94 and create the segments which = would be very good=20 at breaking apart those who churned from those who did not:
The problem is that we all have an intuition that = the name=20 of the customer is not going to be a very good indicator of whether that = customer churns or not. It might work well for this particular 2 = record=20 segment but it is unlikely that it will work for other customer = databases or=20 even the same customer database at a different time. This = particular=20 example has to do with overfitting the model - in this case fitting the = model=20 too closely to the idiosyncrasies of the training data. This can = be fixed=20 later on but clearly stopping the building of the tree short of either = one=20 record segments or very small segments in general is a good idea.
After the tree has been grown to a certain size = (depending=20 on the particular stopping criteria used in the algorithm) the CART = algorithm=20 has still more work to do. The algorithm then checks to see if the = model=20 has been overfit to the data. It does this in several ways using a = cross=20 validation approach or a test set validation approach. Basically = using the=20 same mind numbingly simple approach it used to find the best questions = in the=20 first place - namely trying many different simpler versions of the tree = on a=20 held aside test set. The tree that does the best on the held aside = data is=20 selected by the algorithm as the best model. The nice thing about = CART is=20 that this testing and selection is all an integral part of the algorithm = as=20 opposed to the after the fact approach that other techniques use.
In the late 1970s J. Ross Quinlan introduced a = decision=20 tree algorithm named ID3. It was one of the first decision tree = algorithms=20 yet at the same time built solidly on work that had been done on = inference=20 systems and concept learning systems from that decade as well as = the=20 preceding decade. Initially ID3 was used for tasks such as = learning good=20 game playing strategies for chess end games. Since then ID3 has = been=20 applied to a wide variety of problems in both academia and industry and = has been=20 modified, improved and borrowed from many times over.
ID3 picks predictors and their splitting values = based on=20 the gain in information that the split or splits provide. Gain = represents=20 the difference between the amount of information that is needed to = correctly=20 make a prediction before a split is made and after the split has been=20 made. If the amount of information required is much lower after = the split=20 is made then that split has decreased the disorder of the original = single=20 segment. Gain is defined as the difference between the entropy of = the=20 original segment and the accumulated entropies of the resulting split=20 segments.
ID3 was later enhanced in the version called = C4.5. =20 C4.5 improves on ID3 in several important areas:
predictors with=20 missing values can still be used
predictors with=20 continuous values can be used
pruning = is=20 introduced
rule = derivation=20
Many of these techniques appear in the CART = algorithm plus=20 some others so we will go through this introduction in the CART = algorithm.
CART stands for Classification and Regression =
Trees and is=20
a data exploration and prediction algorithm developed by Leo Breiman, =
Jerome=20
Friedman, Richard Olshen and Charles Stone and is nicely detailed in =
their 1984=20
book =93Classification and Regression Trees=94 ([Breiman, Friedman, =
Olshen and Stone=20
19 84)]. These researchers from=20
Predictors are picked as they decrease the = disorder of the=20 data.
In building the CART tree each predictor is = picked based on=20 how well it teases apart the records with different = predictions. For=20 instance one measure that is used to determine whether a given split = point for a=20 give predictor is better than another is the entropy metric. The = measure=20 originated from the work done by Claude Shannon and Warren Weaver on = information=20 theory in 1949. They were concerned with how information could be=20 efficiently communicated over telephone lines. Interestingly, = their=20 results also prove useful in creating decision trees.
One of the great advantages of CART is that the = algorithm=20 has the validation of the model and the discovery of the optimally = general model=20 built deeply into the algorithm. CART accomplishes this by = building=20 a very complex tree and then pruning it back to the optimally general = tree based=20 on the results of cross validation or test set = validation. The=20 tree is pruned back based on the performance of the various pruned = version of=20 the tree on the test set data. The most complex tree rarely fares = the best=20 on the held aside data as it has been overfitted to the training = data. By=20 using cross validation the tree that is most likely to do well on new, = unseen=20 data can be chosen.
The CART algorithm is relatively robust with = respect to=20 missing data. If the value is missing for a particular predictor = in a=20 particular record that record will not be used in making the = determination of=20 the optimal split when the tree is being built. In effect = CART will=20 utilizes as much information as it has on hand in order to make the = decision for=20 picking the best possible split.
When CART is being used to predict on new = data,=20 missing values can be handled via surrogates. Surrogates are split = values=20 and predictors that mimic the actual split in the tree and can be used = when the=20 data for the preferred predictor is missing. For instance though = shoe size=20 is not a perfect predictor of height it could be used as a = surrogate to=20 try to mimic a split based on height when that information was missing = from the=20 particular record being predicted with the CART model.
Another equally popular decision tree technology = to CART is=20 CHAID or Chi-Square Automatic Interaction Detector. CHAID is = similar to=20 CART in that it builds a decision tree but it differs in the way that it = chooses=20 its splits. Instead of the entropy or Gini metrics for choosing optimal = splits=20 the technique relies on the chi square test used in contingency tables = to=20 determine which categorical predictor is furthest from independence with = the=20 prediction values.
Because CHAID relies on the contingency tables to = form its=20 test of significance for each predictor all predictors must either be=20 categorical or be coerced into a categorical form via binning (e.g. = break up=20 possible people ages into 10 bins from 0-9, 10-19, 20-29 etc.). = Though=20 this binning can have deleterious consequences the actual accuracy = performances=20 of CART and CHAID have been shown to be comparable in real world direct=20 marketing response models.
When data mining algorithms are talked about = these days=20 most of the time people are talking about either decision trees or = neural=20 networks. Of the two neural networks have probably been of greater = interest through the formative stages of data mining technology. = As we=20 will see neural networks do have disadvantages that can be limiting in = their=20 ease of use and ease of deployment, but they do also have some = significant=20 advantages. Foremost among these advantages is their highly = accurate=20 predictive models that can be applied across a large number of different = types=20 of problems.
To be more precise with the term =93neural = network=94 one might=20 better speak of an =93artificial neural network=94. True = neural networks=20 are biological systems (a k a brains) that detect patterns, make=20 predictions and learn. The artificial ones are computer programs=20 implementing sophisticated pattern detection and machine learning = algorithms on=20 a computer to build predictive models from large historical = databases. =20 Artificial neural networks derive their name from their historical = development=20 which started off with the premise that machines could be made to = =93think=94 if=20 scientists found ways to mimic the structure and functioning of the = human brain=20 on the computer. Thus historically neural networks grew out of the = community of Artificial Intelligence rather than from the discipline of=20 statistics. Despite the fact that scientists are still far from=20 understanding the human brain let alone mimicking it, neural networks = that run=20 on computers can do some of the things that people can do.
It is difficult to say exactly when the first = =93neural=20 network=94 on a computer was built. During World War II a seminal = paper was=20 published by McCulloch and Pitts which first outlined the idea that = simple=20 processing units (like the individual neurons in the human brain) could = be=20 connected together in large networks to create a system that could solve = difficult problems and display behavior that was much more complex than = the=20 simple pieces that made it up. Since that time much progress has been = made in=20 finding ways to apply artificial neural networks to real world = prediction=20 problems and in improving the performance of the algorithm in = general. In=20 many respects the greatest breakthroughs in neural networks in recent = years have=20 been in their application to more mundane real world problems like = customer=20 response prediction or fraud detection rather than the loftier goals = that were=20 originally set out for the techniques such as overall human learning and = computer speech and image understanding.
Because of the origins of the techniques and = because of=20 some of their early successes the techniques have enjoyed a great deal = of=20 interest. To understand how neural networks can detect = patterns in a=20 database an analogy is often made that they =93learn=94 to detect these = patterns and=20 make better predictions in a similar way to the way that human beings = do. =20 This view is encouraged by the way the historical training data is often = supplied to the network - one record (example) at a time. = Neural=20 networks do =93learn=94 in a very real sense but under the hood the = algorithms and=20 techniques that are being deployed are not truly different from the = techniques=20 found in statistics or other data mining algorithms. It is for = instance,=20 unfair to assume that neural networks could outperform other techniques = because=20 they =93learn=94 and improve over time while the other techniques are = static. =20 The other techniques if fact =93learn=94 from historical examples in = exactly the=20 same way but often times the examples (historical records) to learn from = a=20 processed all at once in a more efficient manner than neural networks = which=20 often modify their model one record at a time.
A common claim for neural networks is that they = are=20 automated to a degree where the user does not need to know that much = about how=20 they work, or predictive modeling or even the database in order to use=20 them. The implicit claim is also that most neural networks can be=20 unleashed on your data straight out of the box without having to = rearrange or=20 modify the data very much to begin with.
Just the opposite is often true. There are = many=20 important design decisions that need to be made in order to effectively = use a=20 neural network such as:
How should the nodes in the network be = connected? =20
How many neuron like processing units should be = used?
When should =93training=94 be stopped in order = to avoid=20 overfitting?
There are also many important steps required for=20 preprocessing the data that goes into a neural network - most often = there is a=20 requirement to normalize numeric data between 0.0 and 1.0 and = categorical=20 predictors may need to be broken up into virtual predictors that are 0 = or 1 for=20 each value of the original categorical predictor. And, as always,=20 understanding what the data in your database means and a clear = definition of the=20 business problem to be solved are essential to ensuring eventual = success. =20 The bottom line is that neural networks provide no short cuts.
Neural networks are very powerful predictive = modeling=20 techniques but some of the power comes at the expense of ease of use and = ease of=20 deployment. As we will see in this section, neural networks, = create very=20 complex models that are almost always impossible to fully understand = even by=20 experts. The model itself is represented by numeric values in a = complex=20 calculation that requires all of the predictor values to be in the form = of a=20 number. The output of the neural network is also numeric and needs = to be=20 translated if the actual prediction value is categorical (e.g. = predicting the=20 demand for blue, white or black jeans for a clothing manufacturer = requires that=20 the predictor values blue, black and white for the predictor color to be = converted to numbers).
Because of the complexity of these techniques = much effort=20 has been expended in trying to increase the clarity with which the model = can be=20 understood by the end user. These efforts are still in there = infancy=20 but are of tremendous importance since most data mining techniques = including=20 neural networks are being deployed against real business problems where=20 significant investments are made based on the predictions from the = models (e.g.=20 consider trusting the predictive model from a neural network that = dictates which=20 one million customers will receive a $1 mailing).
There are two ways that these shortcomings in = understanding=20 the meaning of the neural network model have been successfully = addressed:
The = neural network=20 is package up into a complete solution such as fraud prediction. = This=20 allows the neural network to be carefully crafted for one particular=20 application and once it has been proven successful it can be used over = and=20 over again without requiring a deep understanding of how it works. =
The = neural network=20 is package up with expert consulting services. Here the neural = network=20 is deployed by trusted experts who have a track record of = success. =20 Either the experts are able to explain the models or they are trusted = that the=20 models do work.
The first tactic has seemed to work quite well = because when=20 the technique is used for a well defined problem many of the = difficulties in=20 preprocessing the data can be automated (because the data structures = have been=20 seen before) and interpretation of the model is less of an issue since = entire=20 industries begin to use the technology successfully and a level of trust = is=20 created. There are several vendors who have deployed this strategy = (e.g.=20 HNC=92s Falcon system for credit card fraud prediction and Advanced = Software=20 Applications ModelMAX package for direct marketing).
Packaging up neural networks with expert = consultants is=20 also a viable strategy that avoids many of the pitfalls of using neural=20 networks, but it can be quite expensive because it is human = intensive. One=20 of the great promises of data mining is, after all, the automation of = the=20 predictive modeling process. These neural network consulting teams = are=20 little different from the analytical departments many companies already = have in=20 house. Since there is not a great difference in the overall = predictive=20 accuracy of neural networks over standard statistical techniques the = main=20 difference becomes the replacement of the statistical expert with the = neural=20 network expert. Either with statistics or neural network experts the = value of=20 putting easy to use tools into the hands of the business end user is = still not=20 achieved.
Neural networks are used in a wide variety of=20 applications. They have been used in all facets of business from=20 detecting the fraudulent use of credit cards and credit risk = prediction to=20 increasing the hit rate of targeted mailings. They also have a = long=20 history of application in other areas such as the military for the = automated=20 driving of an unmanned vehicle at 30 miles per hour on paved roads to = biological=20 simulations such as learning the correct pronunciation of English words = from=20 written text.
Neural networks of various kinds can be used for = clustering=20 and prototype creation. The Kohonen network described in this = section is=20 probably the most common network used for clustering and segmentation of = the=20 database. Typically the networks are used in a unsupervised = learning mode=20 to create the clusters. The clusters are created by forcing the = system to=20 compress the data by creating prototypes or by algorithms that steer the = system=20 toward creating clusters that compete against each other for the records = that=20 they contain, thus ensuring that the clusters overlap as little as = possible.
Sometimes clustering is performed not so much to = keep=20 records together as to make it easier to see when one record sticks out = from the=20 rest. For instance:
Most wine distributors selling inexpensive wine =
in=20
A sale on men=92s suits is being held in all =
branches of a=20
department store for southern =
One of the important problems in all of data = mining is that=20 of determining which predictors are the most relevant and the most = important in=20 building models that are most accurate at prediction. These = predictors may=20 be used by themselves or they may be used in conjunction with other = predictors=20 to form =93features=94. A simple example of a feature in problems = that neural=20 networks are working on is the feature of a vertical line in a computer=20 image. The predictors, or raw input data are just the colored = pixels that=20 make up the picture. Recognizing that the predictors (pixels) can = be=20 organized in such a way as to create lines, and then using the line as = the input=20 predictor can prove to dramatically improve the accuracy of the model = and=20 decrease the time to create it.
Some features like lines in computer images are = things that=20 humans are already pretty good at detecting, in other problem domains it = is more=20 difficult to recognize the features. One novel way that neural = networks=20 have been used to detect features is the idea that features are sort of = a=20 compression of the training database. For instance you could describe an = image=20 to a friend by rattling off the color and intensity of each pixel on = every point=20 in the picture or you could describe it at a higher level in terms of = lines,=20 circles - or maybe even at a higher level of features such as trees, = mountains=20 etc. In either case your friend eventually gets all the = information that=20 they need in order to know what the picture looks like, but certainly = describing=20 it in terms of high level features requires much less communication of=20 information than the =93paint by numbers=94 approach of describing the = color on each=20 square millimeter of the image.
If we think of features in this way, as an = efficient way to=20 communicate our data, then neural networks can be used to automatically = extract=20 them. The neural network shown in Figure 2.2 is used to = extract=20 features by requiring the network to learn to recreate the input data at = the=20 output nodes by using just 5 hidden nodes. Consider that if you = were=20 allowed 100 hidden nodes, that recreating the data for the network would = be=20 rather trivial - simply pass the input node value directly through the=20 corresponding hidden node and on to the output node. But as there = are=20 fewer and fewer hidden nodes, that information has to be passed through = the=20 hidden layer in a more and more efficient manner since there are less = hidden=20 nodes to help pass along the information.
=20
Figure = 2.2 Neural=20 networks can be used for data compression and feature = extraction.
In order to accomplish this the neural network = tries to=20 have the hidden nodes extract features from the input nodes that = efficiently=20 describe the record represented at the input layer. This forced=20 =93squeezing=94 of the data through the narrow hidden layer forces the = neural=20 network to extract only those predictors and combinations of predictors = that are=20 best at recreating the input record. The link weights used to = create the=20 inputs to the hidden nodes are effectively creating features that are=20 combinations of the input nodes values.
A neural network is loosely based on how some = people=20 believe that the human brain is organized and how it learns. Given = that=20 there are two main structures of consequence in the neural network:
The node - which loosely corresponds to the = neuron in the=20 human brain.
The link - which loosely corresponds to the = connections=20 between neurons (axons, dendrites and synapses) in the human brain.
In Figure 2.3 there is a drawing of a simple = neural=20 network. The round circles represent the nodes and the connecting = lines=20 represent the links. The neural network functions by accepting = predictor=20 values at the left and performing calculations on those values to = produce new=20 values in the node at the far right. The value at this node = represents the=20 prediction from the neural network model. In this case the network = takes=20 in values for predictors for age and income and predicts whether the = person will=20 default on a bank loan.
Figure = 2.3 A=20 simplified view of a neural network for prediction of loan = default.
In order to make a prediction the neural network = accepts=20 the values for the predictors on what are called the input nodes. = These=20 become the values for those nodes those values are then multiplied by = values=20 that are stored in the links (sometimes called links and in some ways = similar to=20 the weights that were applied to predictors in the nearest neighbor=20 method). These values are then added together at the node at the = far right=20 (the output node) a special thresholding function is applied and = the=20 resulting number is the prediction. In this case if the resulting = number=20 is 0 the record is considered to be a good credit risk (no default) if = the=20 number is 1 the record is considered to be a bad credit risk (likely=20 default).
A simplified version of the calculations made in = Figure 2.3=20 might look like what is shown in Figure 2.4. Here the value age of = 47 is=20 normalized to fall between 0.0 and 1.0 and has the value 0.47 and the = income is=20 normalized to the value 0.65. This simplified neural network makes the=20 prediction of no default for a 47 year old making $65,000. The = links are=20 weighted at 0.7 and 0.1 and the resulting value after multiplying the = node=20 values by the link weights is 0.39. The network has been trained = to learn=20 that an output value of 1.0 indicates default and that 0.0 indicates=20 non-default. The output value calculated here (0.39) is closer to = 0.0 than=20 to 1.0 so the record is assigned a non-default prediction.
Figure = 2.4 The=20 normalized input values are multiplied by the link weights and added = together at=20 the output.
The neural network model is created by presenting = it with=20 many examples of the predictor values from records in the training set = (in this=20 example age and income are used) and the prediction value from those = same=20 records. By comparing the correct answer obtained from the = training record=20 and the predicted answer from the neural network it is possible to = slowly change=20 the behavior of the neural network by changing the values of the link=20 weights. In some ways this is like having a grade school teacher = ask=20 questions of her student (a.k.a. the neural network) and if the answer = is wrong=20 to verbally correct the student. The greater the error the harsher = the=20 verbal correction. So that large errors are given greater = attention at=20 correction than are small errors.
For the actual neural network it is the weights = of the=20 links that actually control the prediction value for a given = record. Thus=20 the particular model that is being found by the neural network is in = fact fully=20 determined by the weights and the architectural structure of the = network. =20 For this reason it is the link weights that are modified each time an = error is=20 made.
The models shown in the figures above have been = designed to=20 be as simple as possible in order to make them understandable. In = practice no=20 networks are as simple as these. Networks with many more links and many = more=20 nodes are possible. This was the case in the architecture of a = neural=20 network system called NETtalk that learned how to pronounce written = English=20 words. Each node in this network was connected to every node in = the level=20 above it and below it resulting in 18,629 link weights that needed to be = learned=20 in the network.
In this network there was a row of nodes in = between the=20 input nodes and the output nodes. These are called hidden nodes or = the=20 hidden layer because the values of these nodes are not visible to the = end user=20 the way that the output nodes are (that contain the prediction) and the = input=20 nodes (which just contain the predictor values). There are = even more=20 complex neural network architectures that have more than one hidden = layer. =20 In practice one hidden layer seems to suffice however.
The meaning of the input nodes and the output = nodes are=20 usually pretty well understood - and are usually defined by the end user = based=20 on the particular problem to be solved and the nature and structure of = the=20 database. The hidden nodes, however, do not have a predefined = meaning and=20 are determined by the neural network as it trains. Which = poses two=20 problems:
It is = difficult to=20 trust the prediction of the neural network if the meaning of these = nodes is=20 not well understood.
ince the = prediction=20 is made at the output layer and the difference between the prediction = and the=20 actual value is calculated there, how is this error correction fed = back=20 through the hidden layers to modify the link weights that connect = them?=20
The meaning of these hidden nodes is not = necessarily well=20 understood but sometimes after the fact they can be looked at to see = when they=20 are active and when they are not and derive some meaning from = them.
The learning procedure for the neural network has = been=20 defined to work for the weights in the links connecting the hidden=20 layer. A good metaphor for how this works is to think of a = military=20 operation in some war where there are many layers of command with a = general=20 ultimately responsible for making the decisions on where to advance and = where to=20 retreat. The general probably has several lieutenant = generals=20 advising him and each lieutenant general probably has several major = generals=20 advising him. This hierarchy continuing downward through = colonels=20 and privates at the bottom of the hierarchy.
This is not too far from the structure of a = neural network=20 with several hidden layers and one output node. You can think of = the=20 inputs coming from the hidden nodes as advice. The link weight = corresponds=20 to the trust that the general has in his advisors. Some trusted = advisors=20 have very high weights and some advisors may no be trusted and in fact = have=20 negative weights. The other part of the advice from the advisors = has to do=20 with how competent the particular advisor is for a given = situation. The=20 general may have a trusted advisor but if that advisor has no expertise = in=20 aerial invasion and the question at hand has to do with a situation = involving=20 the air force this advisor may be very well trusted but the advisor = himself may=20 not have any strong opinion one way or another.
In this analogy the link weight of a neural = network to an=20 output unit is like the trust or confidence that a commander has in his = advisors=20 and the actual node value represents how strong an opinion this = particular=20 advisor has about this particular situation. To make a decision = the=20 general considers how trustworthy and valuable the advice is and how=20 knowledgeable and confident each advisor is in making their suggestion = and then=20 taking all of this into account the general makes the decision to = advance or=20 retreat.
In the same way the output node will make a = decision (a=20 prediction) by taking into account all of the input from its advisors = (the nodes=20 connected to it). In the case of the neural network this decision = is reach=20 by multiplying the link weight by the output value of the node and = summing these=20 values across all nodes. If the prediction is incorrect the nodes = that had=20 the most influence on making the decision have their weights modified so = that=20 the wrong prediction is less likely to be made the next time.
This learning in the neural network is very = similar to what=20 happens when the wrong decision is made by the general. The = confidence=20 that the general has in all of those advisors that gave the wrong = recommendation=20 is decreased - and all the more so for those advisors who were very = confident=20 and vocal in their recommendation. On the other hand any advisors = who were=20 making the correct recommendation but whose input was not taken as = seriously=20 would be taken more seriously the next time. Likewise any advisor = that was=20 reprimanded for giving the wrong advice to the general would then go = back to his=20 advisors and determine which of them he had trusted more than he should = have in=20 making his recommendation and who he should have listened more closely=20 to.
This feedback can continue in this way down = throughout the=20 organization - at each level giving increased emphasis to those advisors = who had=20 advised correctly and decreased emphasis to those who had advised=20 incorrectly. In this way the entire organization becomes better = and better=20 and supporting the general in making the correct decision more of the = time.
A very similar method of training takes place in = the neural=20 network. It is called =93back propagation=94 and refers to the = propagation of=20 the error backwards from the output nodes (where the error is easy to = determine=20 the difference between the actual prediction value from the training = database=20 and the prediction from the neural network ) through the hidden layers = and to=20 the input layers. At each level the link weights between the = layers are=20 updated so as to decrease the chance of making the same mistake = again.
There are literally hundreds of variations on the = back=20 propagation feedforward neural networks that have been briefly described = here. Most having to do with changing the architecture of the = neural=20 network to include recurrent connections where the output from the = output layer=20 is connected back as input into the hidden layer. These recurrent = nets are=20 some times used for sequence prediction where the previous outputs from = the=20 network need to be stored someplace and then fed back into the network = to=20 provide context for the current prediction. Recurrent networks = have also=20 been used for decreasing the amount of time that it takes to train the = neural=20 network.
Another twist on the neural net theme is to = change the way=20 that the network learns. Back propagation is effectively utilizing = a=20 search technique called gradient descent to search for the best = possible=20 improvement in the link weights to reduce the error. There are, = however,=20 many other ways of doing search in a high dimensional space including = Newton=92s=20 methods and conjugate gradient as well as simulating the physics = of =20 cooling metals in a process called simulated annealing or in simulating = the=20 search process that goes on in biological evolution and using genetic = algorithms=20 to optimize the weights of the neural networks. It has even been=20 suggested that creating a large number of neural networks with = randomly=20 weighted links and picking the one with the lowest error rate would be = the best=20 learning procedure.
Despite all of these choices, the back = propagation learning=20 procedure is the most commonly used. It is well understand, = relatively=20 simple, and seems to work in a large number of problem domains. = There are,=20 however, two other neural network architectures that are used relatively = often. Kohonen feature maps are often used for unsupervised = learning and=20 clustering and Radial Basis Function networks are used for supervised = learning=20 and in some ways represent a hybrid between nearest neighbor and neural = network=20 classification.
Kohonen feature maps were developed in the = 1970=92s and as=20 such were created to simulate certain brain function. Today they = are used=20 mostly to perform unsupervised learning and clustering.
Kohonen networks are feedforward neural networks = generally=20 with no hidden layer. The networks generally contain only an input = layer=20 and an output layer but the nodes in the output layer compete amongst = themselves=20 to display the strongest activation to a given record. What is = sometimes=20 called =93winner take all=94.
The networks originally came about when some of = the=20 puzzling yet simple behaviors of the real neurons were taken into = effect. =20 Namely that physical locality of the neurons seems to play an important = role in=20 the behavior and learning of neurons.
When these networks were run, in order to = simulate the real=20 world visual system it became that the organization that was = automatically being=20 constructed on the data was also very useful for segmenting and = clustering the=20 training database. Each output node represented a cluster and = nearby=20 clusters were nearby in the two dimensional output layer. Each = record in=20 the database would fall into one and only one cluster (the most active = output=20 node) but the other clusters in which it might also fit would be shown = and=20 likely to be next to the best matching cluster.
Since the inception of the idea of neural = networks the=20 ultimate goal for these techniques has been to have them recreate human = thought=20 and learning. This has once again proved to be a difficult task - = despite=20 the power of these new techniques and the similarities of their = architecture to=20 that of the human brain. Many of the things that people take for = granted=20 are difficult for neural networks - like avoiding overfitting and = working with=20 real world data without a lot of preprocessing required. There = have also=20 been some exciting successes.
As with all predictive modeling techniques some = care must=20 be taken to avoid overfitting with a neural network. Neural = networks can=20 be quite good at overfitting training data with a predictive model that = does not=20 work well on new data. This is particularly problematic for neural = networks because it is difficult to understand how the model is = working. =20 In the early days of neural networks the predictive accuracy that was = often=20 mentioned first was the accuracy on the training set and the vaulted or=20 validation set database was reported as a footnote.
This is in part due to the fact that unlike = decision trees=20 or nearest neighbor techniques, which can quickly achieve 100% = predictive=20 accuracy on the training database, neural networks can be trained = forever and=20 still not be 100% accurate on the training set. While this is an=20 interesting fact it is not terribly relevant since the accuracy on the = training=20 set is of little interest and can have little bearing on the validation = database=20 accuracy.
Perhaps because overfitting was more obvious for = decision=20 trees and nearest neighbor approaches more effort was placed earlier on = to add=20 pruning and editing to these techniques. For neural networks=20 generalization of the predictive model is accomplished via rules of = thumb and=20 sometimes in a more methodically way by using cross validation as is = done with=20 decision trees.
One way to control overfitting in neural networks = is to=20 limit the number of links. Since the number of links represents = the=20 complexity of the model that can be produced, and since more complex = models have=20 the ability to overfit while less complex ones cannot, overfitting can = be=20 controlled by simply limiting the number of links in the neural = network. =20 Unfortunately there is no god theoretical grounds for picking a certain = number=20 of links.
Test set validation can be used to avoid = overfitting by=20 building the neural network on one portion of the training database and = using=20 the other portion of the training database to detect what the predictive = accuracy is on vaulted data. This accuracy will peak at some point = in the=20 training and then as training proceeds it will decrease while the = accuracy on=20 the training database will continue to increase. The link weights = for the=20 network can be saved when the accuracy on the held aside data = peaks. The=20 NeuralWare product, and others, provide an automated function that saves = out the=20 network when it is best performing on the test set and even continues to = search=20 after the minimum is reached.
One of the indictments against neural networks is = that it=20 is difficult to understand the model that they have built and also how = the raw=20 data effects the output predictive answer. With nearest neighbor=20 techniques prototypical records are provided to =93explain=94 why the = prediction is=20 made, and decision trees provide rules that can be translated in to = English to=20 explain why a particular prediction was made for a particular=20 record. The complex models of the neural network are = captured solely=20 by the link weights in the network which represent a very complex = mathematical=20 equation.
There have been several attempts to alleviate = these basic=20 problems of the neural network. The simplest approach is to = actually look=20 at the neural network and try to create plausible explanations for the = meanings=20 of the hidden nodes. Some times this can be done quite = successfully. =20 In the example given at the beginning of this section the hidden nodes = of the=20 neural network seemed to have extracted important distinguishing = features in=20 predicting the relationship between people by extracting information = like=20 country of origin. Features that it would seem that a person would = also=20 extract and use for the prediction. But there were also many other = hidden=20 nodes, even in this particular example that were hard to explain and = didn=92t seem=20 to have any particular purpose. Except that they aided the neural = network=20 in making the correct prediction.
Rule induction is one of the major forms of data = mining and=20 is perhaps the most common form of knowledge discovery in unsupervised = learning=20 systems. It is also perhaps the form of data mining that most = closely=20 resembles the process that most people think about when they think about = data=20 mining, namely =93mining=94 for gold through a vast database. The = gold in this=20 case would be a rule that is interesting - that tells you something = about your=20 database that you didn=92t already know and probably weren=92t able to = explicitly=20 articulate (aside from saying =93show me things that are = interesting=94).
Rule induction on a data base can be a massive = undertaking=20 where all possible patterns are systematically pulled out of the data = and then=20 an accuracy and significance are added to them that tell the user how = strong the=20 pattern is and how likely it is to occur again. In general these = rules are=20 relatively simple such as for a market basket database of items scanned = in a=20 consumer market basket you might find interesting correlations in your = database=20 such as:
If = bagels are=20 purchased then cream cheese is purchased 90% of the time and this = pattern=20 occurs in 3% of all shopping baskets.
If live = plants are=20 purchased from a hardware store then plant fertilizer is purchased 60% = of the=20 time and these two items are bought together in 6% of the shopping = baskets.=20
The rules that are pulled from the database are = extracted=20 and ordered to be presented to the user based on the percentage of times = that=20 they are correct and how often they apply.
The bane of rule induction systems is also its = strength -=20 that it retrieves all possible interesting patterns in the = database. This=20 is a strength in the sense that it leaves no stone unturned but it can = also be=20 viewed as a weaknes because the user can easily become overwhelmed with = such a=20 large number of rules that it is difficult to look through all of = them. =20 You almost need a second pass of data mining to go through the list of=20 interesting rules that have been generated by the rule induction system = in the=20 first place in order to find the most valuable gold nugget amongst them = all.=20 This overabundance of patterns can also be problematic for the simple = task of=20 prediction because all possible patterns are culled from the database = there may=20 be conflicting predictions made by equally interesting = rules. =20 Automating the process of culling the most interesting rules and of = combing the=20 recommendations of a variety of rules are well handled by many of the=20 commercially available rule induction systems on the market today and is = also an=20 area of active research.
Rule induction systems are highly automated and = are=20 probably the best of data mining techniques for exposing all possible = predictive=20 patterns in a database. They can be modified to for use in = prediction=20 problems but the algorithms for combining evidence from a variety of = rules comes=20 more from rules of thumbs and practical experience.
In comparing data mining techniques along an axis = of=20 explanation neural networks would be at one extreme of the data mining=20 algorithms and rule induction systems at the other end. Neural = networks=20 are extremely proficient and saying exactly what must be done in a = prediction=20 task (e.g. who do I give credit to / who do I deny credit to) with = little=20 explanation. Rule induction systems when used for prediction on = the other=20 hand are like having a committee of trusted advisors each with a = slightly=20 different opinion as to what to do but relatively well grounded = reasoning and a=20 good explanation for why it should be done.
The business value of rule induction techniques = reflects=20 the highly automated way in which the rules are created which makes it = easy to=20 use the system but also that this approach can suffer from an = overabundance of=20 interesting patterns which can make it complicated in order to make a = prediction=20 that is directly tied to return on investment (ROI).
In rule induction systems the rule itself is of a = simple=20 form of =93if this and this and this then this=94. For example a = rule that a=20 supermarket might find in their data collected from scanners would be: = =93if=20 pickles are purchased then ketchup is purchased=92. Or
If paper plates then plastic forks
If dip then potato chips
If salsa then tortilla chips
In order for the rules to be useful there are two = pieces of=20 information that must be supplied as well as the actual rule:
Accuracy - How often is the rule correct?
Coverage - How often does the rule apply? =
Just because the pattern in the data base is = expressed as=20 rule does not mean that it is true all the time. Thus just like in = other=20 data mining algorithms it is important to recognize and make explicit = the=20 uncertainty in the rule. This is what the accuracy of the rule=20 means. The coverage of the rule has to do with how much of the = database=20 the rule =93covers=94 or applies to. Examples of these two measure = for a=20 variety of rules is shown in Table 2.2.
In some cases accuracy is called the confidence =
of the rule=20
and coverage is called the support. Accuracy and coverage appear =
to be the=20
preferred ways of naming these two measurements.
|
Rule |
Accuracy |
Coverage |
|
If breakfast cereal purchased then milk=20 purchased. |
85% |
20% |
|
If bread purchased then swiss cheese = purchased. |
15% |
6% |
|
If 42 years old and purchased pretzels and = purchased=20 dry roasted peanuts then beer will be purchased. |
95% |
0.01% |
Table = 2.2 =20 Examples of Rule Accuracy and Coverage
The rules themselves consist of two halves. = The left=20 hand side is called the antecedent and the right hand side is called the = consequent. The antecedent can consist of just one condition or = multiple=20 conditions which must all be true in order for the consequent to be true = at the=20 given accuracy. Generally the consequent is just a single = condition=20 (prediction of purchasing just one grocery store item) rather than = multiple=20 conditions. Thus rules such as: =93if x and y then a and b and = c=94.
When the rules are mined out of the database the = rules can=20 be used either for understanding better the business problems that the = data=20 reflects or for performing actual predictions against some predefined = prediction=20 target. Since there is both a left side and a right side to a rule = (antecedent and consequent) they can be used in several ways for your=20 business.
Target the antecedent. In this case all = rules that=20 have a certain value for the antecedent are gathered and displayed to = the=20 user. For instance a grocery store may request all rules that have = nails,=20 bolts or screws as the antecedent in order to try to understand whether=20 discontinuing the sale of these low margin items will have any effect on = other=20 higher margin. For instance maybe people who buy nails also buy = expensive=20 hammers but wouldn=92t do so at the store if the nails were not = available.
Target the consequent. In this case all = rules that=20 have a certain value for the consequent can be used to understand what = is=20 associated with the consequent and perhaps what affects the = consequent. =20 For instance it might be useful to know all of the interesting rules = that have=20 =93coffee=94 in their consequent. These may well be the rules that = affect the=20 purchases of coffee and that a store owner may want to put close to the = coffee=20 in order to increase the sale of both items. Or it might be the = rule that=20 the coffee manufacturer uses to determine in which magazine to place = their next=20 coupons.
Target based on accuracy. Some times the = most=20 important thing for a user is the accuracy of the rules that are being=20 generated. Highly accurate rules of 80% or 90% imply strong = relationships=20 that can be exploited even if they have low coverage of the database and = only=20 occur a limited number of times. For instance a rule that only has = 0.1%=20 coverage but 95% can only be applied one time out of one thousand but it = will=20 very likely be correct. If this one time is highly profitable that = it can=20 be worthwhile. This, for instance, is how some of the most = successful data=20 mining applications work in the financial markets - looking for that = limited=20 amount of time where a very confident prediction can be made.
Target based on coverage. Some times user = want to=20 know what the most ubiquitous rules are or those rules that are most = readily=20 applicable. By looking at rules ranked by coverage they can = quickly=20 get a high level view of what is happening within their database most of = the=20 time.
Target based on =93interestingness=94. = Rules are=20 interesting when they have high coverage and high accuracy and deviate = from the=20 norm. There have been many ways that rules have been ranked by some = measure of=20 interestingness so that the trade off between coverage and accuracy can = be=20 made.
Since rule induction systems are so often used = for pattern=20 discovery and unsupervised learning it is less easy to compare = them. For=20 example it is very easy for just about any rule induction system to = generate all=20 possible rules, it is, however, much more difficult to devise a way to = present=20 those rules (which could easily be in the hundreds of thousands) in a = way that=20 is most useful to the end user. When interesting rules are found = they=20 usually have been created to find relationships between many different = predictor=20 values in the database not just one well defined target of the = prediction. =20 For this reason it is often much more difficult to assign a measure of = value to=20 the rule aside from its interestingness. For instance it would be=20 difficult to determine the monetary value of knowing that if people buy=20 breakfast sausage they also buy eggs 60% of the time. For data = mining=20 systems that are more focused on prediction for things like customer = attrition,=20 targeted marketing response or risk it is much easier to measure the = value of=20 the system and compare it to other systems and other methods for solving = the=20 problem.
It is important to recognize that even though the = patterns=20 produced from rule induction systems are delivered as if then rules they = do not=20 necessarily mean that the left hand side of the rule (the =93if=94 part) = causes the=20 right hand side of the rule (the =93then=94 part) to happen. = Purchasing cheese=20 does not cause the purchase of wine even though the rule if cheese then = wine may=20 be very strong.
This is particularly important to remember for = rule=20 induction systems because the results are presented as if this then that = as many=20 causal relationships are presented.
Typically rule induction is used on databases = with either=20 fields of high cardinality (many different values) or many columns of = binary=20 fields. The classical case of this is the super market basket data = from=20 store scanners that contains individual product names and quantities and = may=20 contain tens of thousands of different items with different packaging = that=20 create hundreds of thousands of SKU identifiers (Stock Keeping = Units).
Sometimes in these databases the concept of a = record is not=20 easily defined within the database - consider the typical Star Schema = for many=20 data warehouses that store the supermarket transactions as separate = entries in=20 the fact table. Where the columns in the fact table are some = unique=20 identifier of the shopping basket (so all items can be noted as being in = the=20 same shopping basket), the quantity, the time of purchase, whether the = item was=20 purchased with a special promotion (sale or coupon). Thus each = item in the=20 shopping basket has a different row in the fact table. This layout = of the=20 data is not typically the best for most data mining algorithms which = would=20 prefer to have the data structured as one row per shopping basket = and each=20 column to represent the presence or absence of a given item. This = can be=20 an expensive way to store the data, however, since the typical grocery = store=20 contains 60,000 SKUs or different items that could come across the = checkout=20 counter. This structure of the records can also create a very high = dimensional space (60,000 binary dimensions) which would be unwieldy for = many=20 classical data mining algorithms like neural networks and decision = trees. =20 As we=92ll see several tricks are played to make this computationally = feasible for=20 the data mining algorithm while not requiring a massive reorganization = of the=20 database.
The claim to fame of these ruled induction = systems is much=20 more so for knowledge discovers in unsupervised learning systems than it = is for=20 prediction. These systems provide both a very detailed view of the = data=20 where significant patterns that only occur a small portion of the time = and only=20 can be found when looking at the detail data as well as a broad overview = of the=20 data where some systems seek to deliver to the user an overall view of = the=20 patterns contained n the database. These systems thus display a = nice=20 combination of both micro and macro views:
Macro Level - Patterns that cover many = situations are=20 provided to the user that can be used very often and with great = confidence and=20 can also be used to summarize the database.
Micro Level - Strong rules that cover only a = very few=20 situations can still be retrieved by the system and proposed to the = end=20 user. These may be valuable if the situations that are covered = are=20 highly valuable (maybe they only apply to the most profitable = customers) or=20 represent a small but growing subpopulation which may indicate a = market shift=20 or the emergence of a new competitor (e.g. customers are only being = lost in=20 one particular area of the country where a new competitor is = emerging).=20
After the rules are created and their = interestingness is=20 measured there is also a call for performing prediction with the=20 rules. Each rule by itself can perform prediction - the consequent = is the=20 target and the accuracy of the rule is the accuracy of the = prediction. But=20 because rule induction systems produce many rules for a given antecedent = or=20 consequent there can be conflicting predictions with different = accuracies. =20 This is an opportunity for improving the overall performance of the = systems by=20 combining the rules. This can be done in a variety of ways by = summing the=20 accuracies as if they were weights or just by taking the prediction of = the rule=20 with the maximum accuracy.
Table 2.3 shows how a given consequent or = antecedent can be=20 part of many rules with different accuracies and coverages. = From=20 this example consider the prediction problem of trying to predict = whether milk=20 was purchased based solely on the other items that were in the shopping=20 basket. If the shopping basket contained only bread then = from the=20 table we would guess that there was a 35% chance that milk was also=20 purchased. If, however, bread and butter and eggs and cheese were=20 purchased what would be the prediction for milk then? 65% chance = of milk=20 because the relationship between butter and milk is the greatest at = 65%? =20 Or would all of the other items in the basket increase even further the = chance=20 of milk being purchased to well beyond 65%? Determining how to = combine=20 evidence from multiple rules is a key part of the algorithms for using = rules for=20 prediction.
|
Antecedent |
Consequent |
Accuracy |
Coverage |
|
bagels |
cream cheese |
80% |
5% |
|
bagels |
orange juice |
40% |
3% |
|
bagels |
coffee |
40% |
2% |
|
bagels |
eggs |
25% |
2% |
|
bread |
milk |
35% |
30% |
|
butter |
milk |
65% |
20% |
|
eggs |
milk |
35% |
15% |
|
cheese |
milk |
40% |
8% |
Table = 2.3=20 Accuracy and Coverage in Rule Antecedents and Consequents
The general idea of a rule classification system = is that=20 rules are created that show the relationship between events captured in = your=20 database. These rules can be simple with just one element in the=20 antecedent or they might be more complicated with many column value = pairs in the=20 antecedent all joined together by a conjunction (item1 and item2 and = item3 =85=20 must all occur for the antecedent to be true).
The rules are used to find interesting patterns = in the=20 database but they are also used at times for prediction. = There are=20 two main things that are important to understanding a rule:
Accuracy - Accuracy refers to the probability = that if the=20 antecedent is true that the precedent will be true. High accuracy = means=20 that this is a rule that is highly dependable.
Coverage - Coverage refers to the number of = records in the=20 database that the rule applies to. High coverage means that the = rule can=20 be used very often and also that it is less likely to be a spurious = artifact of=20 the sampling technique or idiosyncrasies of the database.
From a business perspective accurate rules are = important=20 because they imply that there is useful predictive information in the = database=20 that can be exploited - namely that there is something far from = independent=20 between the antecedent and the consequent. The lower the accuracy = the=20 closer the rule comes to just random guessing. If the accuracy is=20 significantly below that of what would be expected from random guessing = then the=20 negation of the antecedent may well in fact be useful (for instance = people who=20 buy denture adhesive are much less likely to buy fresh corn on the cob = than=20 normal).
From a business perspective coverage implies how = often you=20 can use a useful rule. For instance you may have a rule that is = 100%=20 accurate but is only applicable in 1 out of every 100,000 shopping=20 baskets. You can rearrange your shelf space to take advantage of = this fact=20 but it will not make you much money since the event is not very likely = to=20 happen. Table 2.4. Displays the trade off between coverage = and=20 accuracy.
|
|
Accuracy Low |
Accuracy High |
|
Coverage = High |
Rule is rarely = correct but=20 can be used often. |
Rule is often = correct and=20 can be used often. |
|
Coverage = Low |
Rule is rarely = correct and=20 can be only rarely used. |
Rule is often = correct but=20 can be only rarely = used. |
Table = 2.4 =20 Rule coverage versus accuracy.
An analogy between coverage and accuracy and = making money=20 is the following from betting on horses. Having a high accuracy = rule with=20 low coverage would be like owning a race horse that always won when he = raced but=20 could only race once a year. In betting, you could probably still = make a=20 lot of money on such a horse. In rule induction for retail stores = it is=20 unlikely that finding that one rule between mayonnaise, ice cream and = sardines=20 that seems to always be true will have much of an impact on your bottom=20 line.
One way to look at accuracy and coverage is to = see how they=20 relate so some simple statistics and how they can be represented=20 graphically. From statistics coverage is simply the a priori = probability=20 of the antecedent and the consequent occurring at the same time. The = accuracy is=20 just the probability of the consequent conditional on the = precedent. So,=20 for instance the if we were looking at the following database of super = market=20 basket scanner data we would need the following information in order to=20 calculate the accuracy and coverage for a simple rule (let=92s say milk = purchase=20 implies eggs purchased).
T =3D 100 =3D Total number of shopping baskets in = the=20 database.
E =3D 30 =3D Number of baskets with eggs in = them.
M =3D 40 =3D Number of baskets with milk in = them.
B =3D 20 =3D Number of baskets with both eggs and = milk in=20 them.
Accuracy is then just the number of baskets with = eggs and=20 milk in them divided by the number of baskets with milk in them. = In this=20 case that would be 20/40 =3D 50%. The coverage would be the number = of=20 baskets with milk in them divided by the total number of baskets. = This=20 would be 40/100 =3D 40%. This can be seen graphically in Figure = 2.5.
Figure = 2.5=20 Graphically the total number of shopping baskets can be represented = in a=20 space and the number of baskets containing eggs or milk can be = represented by=20 the area of a circle. The coverage of the rule =93If Milk then = Eggs=94 is just=20 the relative size of the circle corresponding to milk. The = accuracy is the=20 relative size of the overlap between the two to the circle representing = milk=20 purchased.
Notice that we haven=92t used E the number = of baskets=20 with eggs in these calculations. One way that eggs could be used would = be to=20 calculate the expected number of baskets with eggs and milk in them = based on the=20 independence of the events. This would give us some sense of how = unlikely=20 and how special the event is that 20% of the baskets have both eggs and = milk in=20 them. Remember from the statistics section that if two events are=20 independent (have no effect on one another) that the product of their = individual=20 probabilities of occurrence should equal the probability of the = occurrence of=20 them both together.
If the purchase of eggs and milk were independent = of each=20 other one would expect that 0.3 x 0.4 =3D 0.12 or 12% of the time we = would see=20 shopping baskets with both eggs and milk in them. The fact that = this=20 combination of products occurs 20% of the time is out of the ordinary if = these=20 events were independent. That is to say there is a good chance = that the=20 purchase of one effects the other and the degree to which this is the = case could=20 be calculated through statistical tests and hypothesis testing.
One of the biggest problems with rule induction = systems is=20 the sometimes overwhelming number of rules that are produced. Most = of=20 which have no practical value or interest. Some of the rules are = so=20 inaccurate that they cannot be used, some have so little coverage that = though=20 they are interesting they have little applicability, and finally many of = the=20 rules capture patterns and information that the user is already familiar = with.=20 To combat this problem researchers have sought to measure the usefulness = or=20 interestingness of rules.
Certainly any measure of interestingness would = have=20 something to do with accuracy and coverage. We might also expect = it to=20 have at least the following four basic behaviors:
Interestingness =3D 0=20 if the accuracy of the rule is equal to the background accuracy (a = priori=20 probability of the consequent). The example in Table 2.5 shows = an=20 example of this. Where a rule for attrition is no better than = just=20 guessing the overall rate of attrition.
Interestingness=20 increases as accuracy increases (or decreases with decreasing = accuracy) if the=20 coverage is fixed.
Interestingness=20 increases or decreases with coverage if accuracy stays fixed
Interestingness=20
decreases with coverage for a fixed number of correct responses =
(remember=20
accuracy equals the number of correct responses divided by the=20
coverage).
|
Antecedent |
Consequent |
Accuracy |
Coverage |
|
<no constraints> |
then customer will attrite |
10% |
100% |
|
If customer balance > $3,000 |
then customer will attrite |
10% |
60% |
|
If customer eyes =3D blue |
then customer will attrite |
10% |
30% |
|
If customer social security number =3D 144 30 = 8217 |
then customer will attrite |
100% |
0.000001% |
Table = 2.5=20 Uninteresting rules
There are a variety of measures of = interestingness that are=20 used that have these general characteristics. They are used for = pruning=20 back the total possible number of rules that might be generated and then = presented to the user.
Another important measure is that of simplicity = of the=20 rule. This is an important solely for the end user. As = complex=20 rules, as powerful and as interesting as they might be, may be difficult = to=20 understand or to confirm via intuition. Thus the user has a desire = to see=20 simpler rules and consequently this desire can be manifest directly in = the rules=20 that are chosen and supplied automatically to the user.
Finally a measure of novelty is also required = both during=20 the creation of the rules - so that rules that are redundant but strong = are less=20 favored to be searched than rules that may not be as strong but cover = important=20 examples that are not covered by other strong rules. For instance = there=20 may be few historical records to provide rules on a little sold grocery = item=20 (e.g. mint jelly) and they may have low accuracy but since there = are so=20 few possible rules even though they are not interesting they will be = =93novel=94 and=20 should be retained and presented to the user for that reason alone.
Decision trees also produce rules but in a very = different=20 way than rule induction systems. The main difference between the = rules=20 that are produced by decision trees and rule induction systems is as=20 follows:
Decision trees produce rules that are mutually = exclusive=20 and collectively exhaustive with respect to the training database while = rule=20 induction systems produce rules that are not mutually exclusive and = might be=20 collectively exhaustive.
In plain English this means that for an given = record there=20 will be a rule to cover it and there will only be one rule for rules = that come=20 from decision trees. There may be many rules that match a given = record=20 from a rule induction system and for many systems it is not guaranteed = that a=20 rule will exist for each and every possible record that might be = encountered=20 (though most systems do create very general default rules to capture = these=20 records).
The reason for this difference is the way in = which the two=20 algorithms operate. Rule induction seeks to go from the bottom up = and=20 collect all possible patterns that are interesting and then later use = those=20 patterns for some prediction target. Decisions trees on the other = hand=20 work from a prediction target downward in what is known as a = =93greedy=94=20 search. Looking for the best possible split on the next step (i.e. = greedily picking the best one without looking any further than the next=20 step). Though the greedy algorithm can make choices at the higher = levels=20 of the tree which are less than optimal at the lower levels of the tree = it is=20 very good at effectively squeezing out any correlations between = predictors and=20 the prediction. Rule induction systems on the other hand retain = all=20 possible patterns even if they are redundant or do not aid in predictive = accuracy.
For instance, consider that in a rule induction = system that=20 if there were two columns of data that were highly correlated (or in = fact just=20 simple transformations of each other) they would result in two rules = whereas in=20 a decision tree one predictor would be chosen and then since the second = one was=20 redundant it would not be chosen again. An example might be the = two=20 predictors annual charges and average monthly charges (average monthly = charges=20 being the annual charges divided by 12). If the amount charged was = predictive then the decision tree would choose one of the predictors and = use it=20 for a split point somewhere in the tree. The decision tree = effectively=20 =93squeezed=94 the predictive value out of the predictor and then moved = onto the=20 next. A rule induction system would on the other hand create two rules. = Perhaps=20 something like:
If annual charges > 12,000 then default =3D = true 90%=20 accuracy
If average monthly charges > 1,000 the default = =3D true=20 90% accuracy.
In this case we=92ve shown an extreme case where = two=20 predictors were exactly the same, but there can also be less extreme=20 cases. For instance height might be used rather than shoe size in = the=20 decision tree whereas in a rule induction system both would be presented = as=20 rules.
Neither one technique or the other is necessarily = better=20 though having a variety of rules and predictors helps with the = prediction when=20 there are missing values. For instance if the decision tree did = choose=20 height as a split point but that predictor was not captured in the = record (a=20 null value) but shoe size was the rule induction system would still have = a=20 matching rule to capture this record. Decision trees do have ways = of=20 overcoming this difficulty by keeping =93surrogates=94 at each split = point that work=20 almost as well at splitting the data as does the chosen predictor. = In this=20 case shoe size might have been kept as a surrogate for height at this = particular=20 branch of the tree.
One other thing that decision trees and rule = induction=20 systems have in common is the fact that they both need to find ways to = combine=20 and simplify rules. In a decision tree this can be as simple = as=20 recognizing that if a lower split on a predictor is more constrained = than a=20 split on the same predictor further up in the tree that both don=92t = need to be=20 provided to the user but only the more restrictive one. For instance if = the=20 first split of the tree is age <=3D 50 years and the lowest split for = the given=20 leaf is age <=3D 30 years then only the latter constraint needs to be = captured=20 in the rule for that leaf.
Rules from rule induction systems are generally = created by=20 taking a simple high level rule and adding new constraints to it until = the=20 coverage gets so small as to not be meaningful. This means that = the rules=20 actually have families or what is called =93cones of specialization=94 = where one=20 more general rule can be the parent of many more specialized = rules. =20 These cones then can be presented to the user as high level views of the = families of rules and can be viewed in a hierarchical manner to aid in=20 understanding.
Clearly one of the hardest things to do when = deciding to=20 implement a data mining system is to determine which technique to = use=20 when. When are neural networks appropriate and when are decision = trees=20 appropriate? When is data mining appropriate at all as opposed to = just=20 working with relational databases and reporting? When would just = using=20 OLAP and a multidimensional database be appropriate?
Some of the criteria that are important in = determining the=20 technique to be used are determined by trial and error. There are = definite=20 differences in the types of problems that are most conducive to each = technique=20 but the reality of real world data and the dynamic way in which markets, = customers and hence the data that represents them is formed means = that the=20 data is constantly changing. These dynamics mean that it no longer = makes=20 sense to build the "perfect" model on the historical data since = whatever =20 was known in the past cannot adequately predict the future because the = future is=20 so unlike what has gone before.
In some ways this situation is analogous to the = business=20 person who is waiting for all information to come in before they make = their=20 decision. They are trying out different scenarios, different = formulae and=20 researching new sources of information. But this is a task that = will never=20 be accomplished - at least in part because the business the economy and = even the=20 world is changing in unpredictable and even chaotic ways that could = never be=20 adequately predicted. Better to take a robust model that = perhaps is=20 an under-performer compared to what some of the best data mining tools = could=20 provide with a great deal of analysis and execute it today rather than = to wait=20 until tomorrow when it may be too late.
There is always the trade off between exploration = (learning=20 more and gathering more facts) and exploitation (taking immediate = advantage of=20 everything that is currently known). This theme of exploration = versus=20 exploitation is echoed also at the level of collecting data in a = targeted=20 marketing system: from a limited population of prospects/customers = to=20 choose from how many to you sacrifice to exploration (trying out new = promotions=20 or messages at random) versus optimizing what you already know.
There was for instance no reasonable way that = Barnes and=20 Noble bookstores could in 1995 look at past sales figures and foresee = the impact=20 that Amazon books and others would have based on the internet sales = model.
Compared to historic sales and marketing data the = event of=20 the internet could not be predicted based on the data alone. = Instead=20 perhaps data mining could have been used to detect trends of decreased = sales to=20 certain customer sub-populations - such as to those involved in the high = tech=20 industry that were the first to begin to buy books online at Amazon.
So caveat emptor - use the data mining tools well =
but=20
strike while the iron is hot. The performance of predictive model =
provided=20
by data mining tools have a limited half life of decay. Unlike a =
good=20
bottle of wine they do not increase in value with age.
[ Data Mining Page ] [ White Papers ] [ Data Mining=20 Tutorial ]