Saturday, February 25, 2012

Rule of thumb about number of attributes and data points?

This is a general question about data modeling. I'm more curious than anything else.

There is much talk about over-training data model, and I'm sure there are under training as well. As a rule of thumb, depending on the algorithm, what is a good ratio of attributes vs data points?

-Young K

Overtraining basically means that you have a model that fit your training data very well. But the model performs very badly on unseen data. For example, if you are training a decision tree, the alogrithm you use may generate a decision tree to fit your training data perfectly. This can be easily achived by keep creating split until each node only contains data points from one sepcific class. In general, this kind of decision does not generalize well on new data. Let's use an extreme case, say, you wish to predict the weather. You have collected the weather data of the past 10 years. And you accidently choose date as input, and the Decision Tree algorithm generate a tree like this:

date

|

-

| |

Nov 15, 2006 ... Nov 15, 2005 ...

| |

Rainny Sunny

This tree is 100% correct on historical data, but totally useless on predict weather data.

Basically, Over training is a common issue for all algorithms, and a lot of techniques have been applied to reduce to risk of overtraining. The most common traditional one is Occam's razor, which prefer to use simpler models. More recently, Bagging and Boosting are also commonly used to address overtraining problem.

In pratical application, you can validate whether or not your model is overfitting by reserving part of your data as indepent testing, and only use the remaining data to train your model, while validate the accuracy of your model with the testing data.

As to the ratio of attribute number vs data points, there is no absolute requirement, although most algorithms perform better with larger training data.

Good luck,

|||Does SQLServer 2005 have any support for boosting or bagging?
|||SQL Server doesn't have built in support for Bagging and Boosting. One way you can determine "how much" data you need for your particular problem is to reserve a test set and then create models with different amounts on training data. For example 1000, 2000, 4000, 8000, 16000, etc. cases. Graph the accuracy results of the models and see where the accuracy "flattens out". The point where accuracy stops increasing with additional data is the "ideal" amount of data for that particular problem.

No comments:

Post a Comment