verfitting

Overfitting in Life and other experiments

In statistics as in life, overfitting, or 'an analysis that corresponds too closely or exactly to a particular set of data' does not yield optimum predictions. Focusing on the signal over the noise in datasets can prevent overfitting.

Wot? (a.k.a. what do I mean by that)

For those of you who have worked with Machine Learning or taken some classes in statistics, the idea of overfitting is not new.  The best example I found was the XKCD Electoral precedent - (quoting):

  1. In 1788, "No one has been elected a president before" would be a true statement - until Washington was.

  2. In 1796, "No one without false teeth has ever been elected" would be true, until Adams did.

  3. In 1856, "No one can become president without getting married" would be true, until Buchanan did.

You get the idea.  Predictions can be made using only historic data.  However, not all data are created equal.  Overfitting is focusing on irrelevant data - or in other words, focusing on the noise rather than the signal.  The result is often overcomplicated and unrealistic models.

How does that apply to life?

In our own life, we tend to overfit in many places.  Where are you focusing on the noise vs. the signal?  Where are your predictions failing you?  When are your models unreliable?

I remember years ago, a good friend of mine was looking for the right girl to marry.  To say that he was overfitting the model for the "one" to fall in love is an understatement (underfitting?).  What has happened since, he found his dream girl, got his heart broken and re-visited his model.  He has been happily married now for decades.  That heartbreak brought to focus what "really" mattered - in other words, help him see the features that were signal and not the noise.  Suddenly, his model became simpler and more useful.

I remember coaching a young woman who brought up something she wanted to achieve in the upcoming year (redacted specifics for privacy reasons).  When I asked, what is stopping her, she gave me a list of reasons.  For each reason, I countered with "Are you saying that no one in the world who has <reason 1> has achieved goal X?".  Here's an example:

The technique used here is called cross-validation.  By cross-validating the reasons, she soon realized that her reasons were overfitting; noise not signal.  In machine learning as in life, we tend to misuse data.  What results are complicated and underperforming models?