Monday, September 9, 2013

Prototype In The Making

This is just a quick update for our project. We're currently in the works of creating a small working prototype within a week or so. More news will be posted when complete.

Wednesday, September 4, 2013

RCOS Fall 2013 Proposal

ProtoML is a Machine Learning prototyping library built to allow developers to transparently and easily design a workflow for data analysis. In this semester, we will be rewriting ProtoML to work primarily with a distributed cluster as opposed to a single machine. The goal of the project is to ultimately reduce the amount of time and effort that needs to be put in to build up the infrastructure of the typical data science project while allowing for an easy way to balance the workload among multiple machines.

At its core ProtoML is a master/client server setup which will use a REST API to transfer data and execute tasks. The core runtime runs on the master server and interfaces with the user, first through a command line interface and later through a Javascript based Web UI, both of which will be separate projects that use the REST API. The tasks themselves (hereafter referred to as transforms) are encapsulated as modules, meaning that anyone can write their own custom transforms for ProtoML to use. Additionally, it is possible to write these transforms in any language you want, and we will provide documentation to make it as easy as possible.

However, this requires a standard to be set among the actual transforms in order to serialize the data and models needed to execute. This will be done by defining a type system for the data and the transforms, allowing you to specify exactly what type of data a module can use. Modules will then be encapsulated in a JSON schema which specifies things like input/output types, how to execute the script, parameters, etc.

This is important for the next core part of ProtoML -- the error-checker. Before executing a workflow it’s important to ensure that no incompatibilities between modules exist, and doing a check beforehand will prevent wasting a lot of time waiting on data transfers and model training that will ultimately fail due to semantic reasons.

All of the core ProtoML code will be done in Google’s Go language for a variety of reasons: including but not limited to its many features as a systems-level language and its built in high-performance server support.

The execution model will be controlled by a scheduling module. This modular structure allows for dropping in different schedulers for supporting distributed-processing frameworks such as Hadoop and single-server systems.

Tuesday, April 23, 2013

Post-Kaggle Updates

The Kaggle competition finally ended and we updated ProtoML with all the cool tools that we made. This was definitely one of those examples of not knowing what we didn't know, having previously only done theoretical machine learning.

It's pretty amazing looking back how much we've learned since we started the project, and some of the stuff I've learned from this, I know use everyday (like how to use the amazing data analysis library, Pandas).

Moving forward, we have a lot of work to do. We have great ideas of things to implement. While I can't speak for Bharath, I know that I'll be redoing a lot of the old stuff I made by merging the Feature class into the proto_col, a command line tool for doing human-assisted data analysis, into something much much more powerful. Stay tuned for updates!

Tuesday, March 26, 2013

Practical Machine Learning

Bharath and I have been working on a Kaggle competition for a few months: The Blue Book For Bulldozers Challenge. Our job is to predict as accurately as possible the price a bulldozer will sell for given specs of the machine and other such observations, given the selling price of over 400,000 past sales, and the competition goes on to the next round on April 10th!

Throughout the competition, we've made a lot of functions that would be just generally helpful for real world practical machine learning that we had no clue about when we only studied theory. One example of this was with caching intermediate results. We found out using Python's pickle to serialize and deserialize a Pandas DataFrame or Numpy array actually was less space efficient and way slower than simply storing the objects as csv files! This was pretty amazing, because the csv files are more portable (e.g., we can use it with R or MATLAB) and faster! Another intuitiveness discovery was that our function of writing an array or DataFrame into svmlight format (a popular format for machine learning algorithms).

For now, we've been working hard on the competition, and for the sake of ease (i.e., we don't have to update the library every time we need a new function), writing these useful functions into the project code for now, but when we have free time, they'll be added to ProtoML.

Friday, March 8, 2013

"sofia-kmeans" and Progress on Documentation

Just as a quick update, I've been working on integrating other libraries. I've spend most of the time (failing) on WiseRF, a tool for extremely efficient random forests, and a good amount of time of sofia-kmeans, a clustering tool. On my to-do list, I have:

  1. Fixing the WiseRF wrapper.
  2. Making a Vowpal Wabbit wrapper.
  3. Making a sofia-ml wrapper.
But here's the awesome news. Documentation is coming nicely. My wrapper for sofia-kmeans is about 50% docstrings (and I've begun retroactively documenting some utility functions):

Monday, March 4, 2013

ProtoML Features In Action!

As promised, here's the sample! This is a small snippet of code for analyzing a dataset that we've been playing with. I actually had hundreds of lines of non-ProtoML analysis already made that I replaced with these ~40 lines, and I can verify that it makes the analysis much easier.

In order to understand how the above code works, you just need to understand the anatomy of a feature transform (note: this may change in the future):
Feature Transforms are tuples/lists in the form: 

  1. CHILD -  A string or None specifying what the resulting columns will be named. A child of None means no columns will be added.
  2. FILTER - This can be a lot of things. If it's an integer or slice, it gets those specified column numbers. If it's a string, it's treated as a regular expression to filter out column names. If it's a function, it can be applied to column names, and if the result evaluates to true, the column name is used, and it could be a list of any of the above.
  3. TRANSFORM - This takes in the data specified by the filter. This can be either a class which has fit/fit_transform/transform/predict/etc. called on it with the data, a function that takes in the data, or just raw data whose columns can be added to the Feature's data.
  4. REMOVE - An optional parameter. If set to True, the columns specified by the filter are removed after the transform.

Key things:

  • Anyone can make their own custom transforms (I actually made 3 in the sample: one for performing a Barnes-Hut t-SNE, one for randomized PCA, and the other to print some plots, since we're still working on our visualization stuff). My hope was that data scientists would actually build their own personal libraries of transforms, and whenever they work on a new dataset, it's simply a matter of mixing and matching.
  • It's super easy to change the dataflow by just commenting out a feature transform or reordering the transforms.
  • Best of all, exactly what you're doing to the data is clear as crystal.
  • This just mostly shows the basic Feature class, and we're planning on adding integration with a lot of external libraries soon.

Wednesday, February 27, 2013

The Feature Class

The class that I was spending the most time on appears to be complete. All that's missing now is a lot of test cases to make sure everything is working as intended and some smoothing of the design. I'll have some sample code up shortly (hopefully tonight). I'm actually working on converting my machine learning research code to use ProtoML, so I can show a side by side comparison. Some of the cool stuff though are:
  • automated feature transforms
  • regex indexing
  • easy concatenation of data frames (still making it more elegant but it works)
  • lazy hashing for caching
  • and pretty much everything normal data frames can do

Other than that, we're still working on tests and docs, and implementing more features!

Tuesday, February 19, 2013

Designing Usability

Getting the simple stuff to work like fit(), predict(), and transform() was the easy part, we hope. Since we both want to be able to use ProtoML ourselves and also share it with the world, we are working on a ton of different usability issues right now.

Priority number one is a really good test suite (in my opinion). We have a bunch of code that we think should work, but now we need to show it works. Because of this, we spent a good part of the last week working on documentation and unit tests. The idea is that it will make it easier to develop in the long term because every new feature should only require a few extra test cases and we'll be instantly able to see if it works with no guesswork needed. Priority number two has been documentation. We have auto-docs set up, so all we need to do is comment our code nicely. We have most of the new features on pause until we have satisfactory coverage in these two aspects.

One big topic of this week was dependencies. We always wanted our library to be a sort of glue to connect a variety of other libraries, and this inevitably will lead to a lot of dependencies. I (Diogo) want to minimize dependencies by having everything unnecessary be in their own sub-modules partitioned by dependency for easy of use of the user, and Bharath wants all dependencies as a requirement so that it just works. There are obviously pros and cons of each, and we'd love to hear opinions on this.

A second big argument point was following standard python conventions, namely using a context manager to add nodes (for an example, see last post where the nodes are created). Bharath argues that it makes the code easier to make a more readable, while I argue that it makes it unconventional and thus harder for an average user to understand. Bharath thinks that there could be amazing possibilities by having all sorts of code in the context manager (for loops for example), and I agree that it looks great and follows the DRY principle. I just think that taking in a list of nodes would only take a little more code and make a lot more sense to most people. Furthermore, we are trying to decide if I should allow feature transforms to be created in the same manner. There may be a big debate soon on this...

Thursday, February 14, 2013

An Early Example

This is some alpha sample code from a user perspective on how to work with ProtoML:
This code setups a basic dataflow from going from input data to machines and then scoring. ProtoML is made to be a high-level framework to glue together different data analysis libraries. This is done through constructing nodes that act as containers for these libraries' functions and then connecting the nodes together to create a dataflow network. The nodes themselves are also relatively easy to construct, this will be featured in a later post. The ones shown above are scikit-learn container nodes.

 This example used a built-in dataset. Diogo is currently working data handling and feature transformation and that will be talked about in a blog post very soon.

Tuesday, February 5, 2013

An Introduction To ProtoML

What is ProtoML?

ProtoML is a machine learning library built on top of scikit-learn (and hopefully a few more libraries soon!) with an aim for ease of use and rapid prototyping. We are part of a Kaggle group at RPI and we were searching for easy to use machine learning libraries and frameworks to quickly hack out some data analysis. Some of our favorites include scikit-learn, Orange, and Ramp. But none of them made it really easy to get off the ground once you have some clean data. There was always some hoops to jump through to start scaling out and trying different machines and using different features. That is why we decided to create a meta-modeling machine learning framework to make it as simple as possible to chain together feature selection and machines in different combinators.

Who is behind ProtoML?
Diogo Moitinho de Almeida & Bharath Santosh! Two students from RPI with too much free time and a dream (just kidding about the free time, don’t give me more homework Prof. Goldschmidt -Diogo).

What are our goals for the semester?
  • Make the implementation of machine learning algorithms as simple to try out as possible.
  • Implement some features missing from scikit-learn that are simple yet time consuming.
  • Provide a framework for automating as much of the data analysis process as possible.
  • Have everything run fast. Do as much possible in Cython, and try to cache everything.
  • Eventually act as the glue between the wide variety of available Python machine learning libraries.
  • win a kaggle competition

Where can you learn more?
To see all that we have available and use our latest prototypes, check out: (you should do it; we love guinea pigs)

What’s next for the blog?
We are going to do a combination of rough overviews of machine learning concepts and how to use them with ProtoML, and keep everyone updated with the latest and greatest features!

Minor update:
Tons of progress and we just finished our very first meeting as an official RCOS group! Yay for us! We will soon be putting up a blog by next week on sample code to run for basic machine learning learning.