Blog Directory : Listing Details

Listing Details

Recent Posts:

ID:1297
Title:Data Miners Blog
URL:http://www.data-miners.com/blog/
Category:Business: Data Mining
Description:Data analysis and visualization from an industry point of view.
This blog has moved - Wed, 31 Mar 2010 17:44:00 +0000

This blog is now located at http://blog.data-miners.com/.
You will be automatically redirected in 30 seconds, or you may clickhere.

For feed subscribers, please update your feed subscriptions to
http://blog.data-miners.com/feeds/posts/default.

Bitten by an Unfamiliar Form of Left Truncation - Mon, 15 Mar 2010 02:32:00 +0000
Alternate title:Data Mining Consultant with Egg on Face

Last week I made a client presentation. The project was complete. I was presenting the final results to the client.  The CEO was there. Also the CTO, the CFO, the VPs of Sales and Marketing, and the Marketing Analytics Manager. The client runs a subscription-based business and I had been analyzing their attrition patterns. Among my discoveries was that customers with "blue" subscriptions last longer than customers with "red" subscriptions. By taking the difference of the area under the two survival curves truncated at one year and multiplying by the subscription cost, I calculated the dollar value of the difference. I put forward some hypotheses about why the blue product was stickier and suggested a controlled experiment to determine whether having a blue subscription actually caused longer tenure or was merely correlated with it. Currently, subscribers simply pick blue or red at sign-up. There is no difference in price.  I proposed that half of new customers be given blue by default unless they asked for red and the other half be given red by default unless they asked for blue. We could then look for differences between the two randomly assigned groups.

All this seemed to go over pretty well.  There is only one problem.  The blue customers may not be better after all.  One of the attendees asked me whether the effect I was seeing could just be a result of the fact that blue subscriptions have been around longer than red ones so the oldest blue customers are older than the oldest red customers. I explained that this would not bias my findings because all my calculations were based on the tenure time line, not the calendar time line. We were comparing customers' first years without regard to when they happened. I explained that therewouldbe a problem if the data set suffered from left truncation, but I had tested for that, and it was not a problem because we knew about starts and stops since the beginning of time.

Left truncationis something that creates a bias in many customer databases.  What it means is that there is no record of customers who stopped before some particular date in the past--the left truncation date. The most likely reason is that the company has been in existence longer than its data warehouse. When the warehouse was created, all active customers were loaded in, but customers who had already left were not. Fine, for most applications, but not for survival analysis. Think about customers who started before the warehouse was built.  One (like many thousands of others) stops before the warehouse gets built with a short tenure of two months. Another, who started on the same day as the first, is still around two be loaded into the warehouse with a tenure of two years.  Lots of short-tenure people are missing and long-tenure people are over represented. Average tenure is inflated and retention appears to be better than it really is.

My client's data did not have that problem.  At least, not in the way I am used to looking for it.  Instead, it had a large number of stopped customers for whom the subscription type had been forgotten. I (foolishly) just left these people out of my calculations.  Here is the problem: Although the customer start and stop dates are remembered for ever, certain details, including the subscription type,  are purged after a certain amount of time. For all the people who started back when there were only blue subscriptions and had short or even average tenures, that time had already past. The only ones for whom I could determine the subscription type were those who had unusually long tenures.  Eliminating the subscribers for whom the subscription type had been forgotten had exactly the same effect as left truncation!

If this topic and things related to it sound interesting to you, it is not too late to sign up for a two-day class I will be teaching in New Yorklater this week.  The class is calledSurvival Analysis for Business Time to Event Problems. It will be held at the offices of SAS Institute in Manhattan this Thursday and Friday, March 18-19.

Agglomerative Variable Clustering - Thu, 25 Feb 2010 20:55:00 +0000
Lately, I've been thinking about the topic of reducing the number of variables, and how this is a lot like clustering variables (rather than clustering rows). This post is about a method that seems intuitive to me, although I haven't found any references to it. Perhaps a reader will point me to references and a formal name. This method using Pearson correlation and principal components to agglomeratively cluster the variables.

Agglomerative clustering is the process of assigning records to clusters, starting with the records that are closest to each other. This process is repeated, until all records are placed into a single cluster. The advantage of agglomerative clustering is that it creates a structure for the records, and the user can see different numbers of clusters. Divisive clustering, such as implemented by SAS's varclus proc, produces something similar, but from the top-down.

Agglomerative variable clustering works the same way. Two variables are put into the same cluster, based on their proximity. The cluster then needs to be defined in some manner, by combining information in the cluster.

The natural measure for proximity is the square of the (Pearson) correlation between the variables. This is a value between 0 and 1 where 0 is totally uncorrelated and 1 means the values are colinear. For those who are more graphically inclined, this statistic has an easy interpretation when there are two variables. It is the R-square value of the first principal component of the scatter plot.

Combining two variables into a cluster requires creating a single variable to represent the cluster. The natural variable for this is the first principal component.

My proposed clustering method repeatedly does the following:
  1. Finds the two variables with the highest correlation.
  2. Calculates the principal component for these variables and adds it into the data.
  3. Maintains the information that the two variables have been combined.
The attached SAS code (available atsas-var-hierarchical-clustering-v01.sas) does exactly this, although not in the most efficient and robust way. The bulk of the code is a macro, calledbuildcolumns, that appends the new cluster variables to the data set and maintains another table calledcolumnswhich has the information about the rows. After I run this code, I can select different numbers of variables using the expression:

proc sql;
....select colname
....from columns
....where counter<= [some number]<>

These variables can then be used for predictive models or visualization purposes.

The inner loop of the code works by doing the following:
  1. Callingproc corrto calculate the correlation of all variables not already in a cluster.
  2. Transposing the correlations into a table with three columns, two for the variables and one for the correlation usingproc transpose.
  3. Finding the pair of variables with the largest correlation.
  4. Calculating the first principal component for these variables.
  5. Appending this principal component to the data set.
  6. Updating thecolumnsdata set with information about the new cluster.
The data set referred to in the code comes from the companion site forData Analysis Using SQL and Excel. The code will fail (by running an infinite loop) if any variables are missing or if two variables are exactly correlated.