Tuesday, March 26, 2013

Practical Machine Learning

Bharath and I have been working on a Kaggle competition for a few months: The Blue Book For Bulldozers Challenge. Our job is to predict as accurately as possible the price a bulldozer will sell for given specs of the machine and other such observations, given the selling price of over 400,000 past sales, and the competition goes on to the next round on April 10th!

Throughout the competition, we've made a lot of functions that would be just generally helpful for real world practical machine learning that we had no clue about when we only studied theory. One example of this was with caching intermediate results. We found out using Python's pickle to serialize and deserialize a Pandas DataFrame or Numpy array actually was less space efficient and way slower than simply storing the objects as csv files! This was pretty amazing, because the csv files are more portable (e.g., we can use it with R or MATLAB) and faster! Another intuitiveness discovery was that our function of writing an array or DataFrame into svmlight format (a popular format for machine learning algorithms).

For now, we've been working hard on the competition, and for the sake of ease (i.e., we don't have to update the library every time we need a new function), writing these useful functions into the project code for now, but when we have free time, they'll be added to ProtoML.

No comments:

Post a Comment