Blog Directory : Listing Details

Listing Details

Recent Posts:

ID:1296
Title:Abbott Analytics
URL:http://abbottanalytics.blogspot.com/
Category:Business: Data Mining
Description:Both industry and research oriented posts covering any topic related to data mining.
Top 5 Posts from 2011 - Fri, 06 Jan 2012 00:44:00 +0000
By far, the most visited post of 2011 was the "What Do Data Miners Need to Learn" post from June.

The top five visited posts that were first posted in 2011 are (with actual ranks for all posts):
1.What Do Data Miners Need to Learn
2.Statistical Rules of Thumb, Part III
3.Statistical Rules of Thumb, Part II
4.Number of Hidden Layer Neurons to Use
5.Statistics: The Need for Integration


The top six viewed posts in 2011 originally created prior to 2011 were:
1.Why Normalization Matters with K-Means(2009)
2.Free and Inexpensive Data Mining Software(2006)
3.Data Mining Data Sets(2008)
4.Can you Learn Data Mining in Undergraduate or Graduate School(2009)
5.Quotes from Moneyball(2007)
6.Business Analytics vs. Business Intelligence(2009)

The "Free Data Mining Tools" post is understandably relatively popular, even after 5 years. The Moneyball quotes has a particularly high bounce rate. I'm most surprised that the K-Means normalization post has remained popular for so long.

Models Behaving Badly - Thu, 29 Dec 2011 02:25:00 +0000
I just read a fascinating book review in the Wall Street JournalPhysics Envy: Models Behaving Badly. The author of the book, Emanuel Derman (former head of Quantitative Analsis at Goldman Sachs) argues that the financial models involved human beings and therefore were inherently brittle: as human behavior changed, the models failed. "in physics you're playing against God, and He doesn't change His laws very often. In finance, you're playing against God's creatures."

I'll agree with Derman that whenever human beings are in the loop, data suffers. People change their minds based on information not available to the models.

I also agree that human behavioral modeling is not the same as physical modeling. We can use the latter to provide motivation and even mathematics for human behavioral modeling, but we should not take this too far. A simple example is this: purchase decisions sometimes depend not on the person's propensity to purchase alone, but also on whether or not they had an argument that morning, or if they just watched a great movie. There is an emotional component that data cannot reflect. People therefore behave in ways that on the surface are contradictory, seemingly "random", which is way response rates of 1% can be "good".

However, I bristle a bit at the the emphasis on the physics analogy. In closed systems, models can explain everything. But once one opens up the world, even physical models are imperfect because they often do not incorporateallthe information available. For example, missile guidance is based on pure physics: move a surface on a wing and one can change the trajectory of the missile. There are equations of motion that describe exactly where the missile will go. There is no mystery here.

However, all operational missile guidances systems are "closed loop"; the guidance command sequence is not completely scheduled but is updated throughout the flight. Why? To compensate for unexpected effects of the guidance commands, often due to ballistic winds, thermal gradients, or other effects on the physical system. It is the closed-loop corrections that make missile guidance work. The exact same principal applies to your car's cruise control, chasing down a fly ball in baseball, or even just walking down the street.

For a predictive model to be useful long-term, it needs updating to correct for changes in the population the models are applied to, whether the models be for customer acquisition, churn, fraud detection, or any model. The "closed-loop" typical in data mining is called "model updating" and is critical for long-term modeling success.

The question then becomes this: can the models be updated quickly enough to compensate for changes in the population? If a missile can only be updated at 10Hz (10x / sec.) but uncertainties effect the trajectory significantly in milliseconds, the closed-loop actions may be insufficient to compensate. If your predictive can only be updated monthly, but your customer behavior changes significantly on a weekly basis, your models will be behind perpetually. Measuring the effectiveness of model predictions is therefore critical in determining the frequency of model updating necessary in your organization.

To be fair, until I read the book I have no quibble with the arguments. The arguments here are based solely on the book review and some ideas they prompted in my mind. I'd welcome comments from anyone who has read the book already.

The book can be found on amazonhere.

UPDATE: Aaron Laiwrote an article for CFA Magazineon the same topic, also quoting Derman. I commend the article to all (note: this is a PDF file download).

Statistical Rules Of Thumb, part III: Always Visualize the Data - Fri, 04 Nov 2011 22:36:00 +0000
As I perusedStatistical Rules of Thumbagain, as I do from time to time, I came across this gem. (note: I live in CA, so get no money from these amazon links).

Van Belle uses the term "Graph" rather than "Visualize", but it is the same idea. The point is to visualizein addition tocomputing summary statistics. Summaries are useful, but can be deceiving; any time you summarize data you will lose some information unless the distributions are well behaved. The scatterplot, histogram, box and whiskers plot, etc. can reveal ways the summaries can fool you. I've seen these as well, especially variables with outliers or that are bi- or tri-modal.

One of the most famous examples of this effect isAnscombe's Quartet. I'm including the Wikipedia image of the plots here:


All four datasets have the same mean x values, y values, x standard deviation, y standard deviation, x-y pearson correlation coefficient, and regression line of y, so the summaries don't tell the differences in the data.

I use correlations a lot to get the gist of the relationships in the data, and I've seen how correlations can deceive. In one project, we had 30K data points with a correlation of 0.9+. When we removed just 100 of these data points (the largest magnitudes of x and y), the correlation shrunk to 0.23.

Most data mining software has ways to visualize data easily now. Avail yourself to them to avoid subsequent surprises in your data.