Wednesday, September 4, 2013

RCOS Fall 2013 Proposal

ProtoML is a Machine Learning prototyping library built to allow developers to transparently and easily design a workflow for data analysis. In this semester, we will be rewriting ProtoML to work primarily with a distributed cluster as opposed to a single machine. The goal of the project is to ultimately reduce the amount of time and effort that needs to be put in to build up the infrastructure of the typical data science project while allowing for an easy way to balance the workload among multiple machines.

At its core ProtoML is a master/client server setup which will use a REST API to transfer data and execute tasks. The core runtime runs on the master server and interfaces with the user, first through a command line interface and later through a Javascript based Web UI, both of which will be separate projects that use the REST API. The tasks themselves (hereafter referred to as transforms) are encapsulated as modules, meaning that anyone can write their own custom transforms for ProtoML to use. Additionally, it is possible to write these transforms in any language you want, and we will provide documentation to make it as easy as possible.

However, this requires a standard to be set among the actual transforms in order to serialize the data and models needed to execute. This will be done by defining a type system for the data and the transforms, allowing you to specify exactly what type of data a module can use. Modules will then be encapsulated in a JSON schema which specifies things like input/output types, how to execute the script, parameters, etc.

This is important for the next core part of ProtoML -- the error-checker. Before executing a workflow it’s important to ensure that no incompatibilities between modules exist, and doing a check beforehand will prevent wasting a lot of time waiting on data transfers and model training that will ultimately fail due to semantic reasons.

All of the core ProtoML code will be done in Google’s Go language for a variety of reasons: including but not limited to its many features as a systems-level language and its built in high-performance server support.

The execution model will be controlled by a scheduling module. This modular structure allows for dropping in different schedulers for supporting distributed-processing frameworks such as Hadoop and single-server systems.

No comments:

Post a Comment