Tuesday, March 26, 2013

Practical Machine Learning

Bharath and I have been working on a Kaggle competition for a few months: The Blue Book For Bulldozers Challenge. Our job is to predict as accurately as possible the price a bulldozer will sell for given specs of the machine and other such observations, given the selling price of over 400,000 past sales, and the competition goes on to the next round on April 10th!

Throughout the competition, we've made a lot of functions that would be just generally helpful for real world practical machine learning that we had no clue about when we only studied theory. One example of this was with caching intermediate results. We found out using Python's pickle to serialize and deserialize a Pandas DataFrame or Numpy array actually was less space efficient and way slower than simply storing the objects as csv files! This was pretty amazing, because the csv files are more portable (e.g., we can use it with R or MATLAB) and faster! Another intuitiveness discovery was that our function of writing an array or DataFrame into svmlight format (a popular format for machine learning algorithms).

For now, we've been working hard on the competition, and for the sake of ease (i.e., we don't have to update the library every time we need a new function), writing these useful functions into the project code for now, but when we have free time, they'll be added to ProtoML.

Friday, March 8, 2013

"sofia-kmeans" and Progress on Documentation

Just as a quick update, I've been working on integrating other libraries. I've spend most of the time (failing) on WiseRF, a tool for extremely efficient random forests, and a good amount of time of sofia-kmeans, a clustering tool. On my to-do list, I have:

  1. Fixing the WiseRF wrapper.
  2. Making a Vowpal Wabbit wrapper.
  3. Making a sofia-ml wrapper.
But here's the awesome news. Documentation is coming nicely. My wrapper for sofia-kmeans is about 50% docstrings (and I've begun retroactively documenting some utility functions):

Monday, March 4, 2013

ProtoML Features In Action!

As promised, here's the sample! This is a small snippet of code for analyzing a dataset that we've been playing with. I actually had hundreds of lines of non-ProtoML analysis already made that I replaced with these ~40 lines, and I can verify that it makes the analysis much easier.

In order to understand how the above code works, you just need to understand the anatomy of a feature transform (note: this may change in the future):
Feature Transforms are tuples/lists in the form: 

  1. CHILD -  A string or None specifying what the resulting columns will be named. A child of None means no columns will be added.
  2. FILTER - This can be a lot of things. If it's an integer or slice, it gets those specified column numbers. If it's a string, it's treated as a regular expression to filter out column names. If it's a function, it can be applied to column names, and if the result evaluates to true, the column name is used, and it could be a list of any of the above.
  3. TRANSFORM - This takes in the data specified by the filter. This can be either a class which has fit/fit_transform/transform/predict/etc. called on it with the data, a function that takes in the data, or just raw data whose columns can be added to the Feature's data.
  4. REMOVE - An optional parameter. If set to True, the columns specified by the filter are removed after the transform.

Key things:

  • Anyone can make their own custom transforms (I actually made 3 in the sample: one for performing a Barnes-Hut t-SNE, one for randomized PCA, and the other to print some plots, since we're still working on our visualization stuff). My hope was that data scientists would actually build their own personal libraries of transforms, and whenever they work on a new dataset, it's simply a matter of mixing and matching.
  • It's super easy to change the dataflow by just commenting out a feature transform or reordering the transforms.
  • Best of all, exactly what you're doing to the data is clear as crystal.
  • This just mostly shows the basic Feature class, and we're planning on adding integration with a lot of external libraries soon.