Skynet, A New Ruby MapReduce

The MapReduce design pattern to distribute data processing was introduced by Google in 2004, and came with a C++ implementation. A new Ruby implementation is now available under the name of Skynet released by Adam Pisoni.

Skynet is an adaptive, self-upgrading, fault-tolerant, and fully distributed system with no single point of failure.

There are notably 2 key differences between Google's design paper and Skynet:

Skynet cannot send raw code to workers.
Skynet uses a peer recovery system where workers watch out for each other:

If a worker dies or fails for any reason, another worker will notice and pick up that task. Skynet also has no special ‘master’ servers, only workers which can act as a master for any task at any time. Even these master tasks can fail and will be picked up by other workers.

Skynet is very easy to use and set up which is the real strength of the MapReduce concept. Skynet also extends ActiveRecord with MapReduce features such as distributed_find.

Another Ruby MapReduce-like has existed in Ruby for 1.5 years: Starfish. Reading about Peter Cooper's mixed feelings about Starfish, InfoQ caught up with Adam Pisoni about Skynet features and how it compares with Starfish.

How do you compare Skynet with Starfish?
I looked at Starfish before developing Skynet and decided it wasn't robust enough for what I needed. Starfish is a simple system with a great many limitations in terms of scalability, and control. Also there is some question as to how well starfish actually distributes tasks since Ruby actually can't marshal and send code blocks over the wire, only references to it. So if you say, run block X over on machine Y, then machine Y will merely call for that code to be run on X to begin with. In that sense, I'm not sure how it's distributed.

The part of Starfish I'm still confused at and I even exchanged some emails with the author about is how it deals with actual code distribution with DRB. In Starfish you just provide a block of code to use for the map. It turns that block into a DRB object and sends a reference to that object over to a worker. That worker is supposed to execute that code locally... but Ruby DRB does not allow this. The code is always executed on the machine it was compiled on. So as long as all your workers are running on the same machine it would work, but as soon as you try running workers on another machine it will only appear as though the code is being sent to that machine, when in fact the code is really executed on the source.

The biggest other limitation of Starfish is that you can not run a job asynchronously. So, for example, if you want an action on a web page to kick off a map/reduce process, you can't just start a starfish job and let it run while you move on. Whoever kicks off a starfish job has to wait for that job to finish.
In Starfish you write these little apps with the code you want to run built into them. Unless I'm mistaken, you can't run multiple types of MR jobs on the same machine. Skynet is a general MR system that can be running many jobs of many types, ie a lot of different code.

Can you tell us about the strength of Skynet?
Skynet is built on top of a message queue where you can choose what message Q you use based on your scalability requirements. It currently supports tuplespace and mysql. We use mysql because it is far more scalable than TS.
You can then create jobs with total flexibility on how skynet will distribute and execute your job. The most common case we use at geni is to run just asynchronously (which starfish can't do). So you create a new MR job, call run on it and it immediately returns. In the background it added your job to the queue and will get worked on by workers. You can later retrieve the results by calling results on the job object you've got.

Skynet also supports failover. Workers watch each other. If one worker fails to complete a task in time, another worker can pick it up and try and complete it. Skynet also supports streaming map_data, meaning it can work with very large data sets, too large to put into a single data structure.

What is map_data streaming?
Most of the time when you want to run a map_reduce job you have to provide an array of data you want split up and executed in parallel. What if your array would be too large to fit into memory? In this case you provide Skynet with an Enumerable instead of an array. Skynet knows to call :next or :each on that object and start spinning off map_tasks for 'each' one. In this way, no one ever tries to create some massive data structure all at the same time.

Any other features you want to talk about?
There are many more features, but the last feature I'd mention is how well Skynet integrates with your existing applications, including rails apps. It even comes with an extension to ActiveRecord that will let you execute some task on all of your models in a distributed manner. We use this functionality at Geni to run particularly complex migrations that involve executing some ruby code on millions of models.

> Model.distributed_find(:all, :conditions => "id > 20").each(:somemethod)

As long as your running Skynet, it will execute :somemethod on each model, but in a distributed manner (on as many workers as you have). It does this without instantiating the models before distributing it, or even fetching all the ids ahead of time. So it can work on infinitely large data sets.

What are users feedbacks?
There are a few people starting to use it, but its definitely in the early stages. Release 0.9.2 is a pretty significant release with a lot of rewrites, performance improvements, and feature enhancements. We've submitted to give a talk at Railsconf on Skynet, but haven't heard back yet. We're also planning on creating a Screencast on how to use Skynet.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Performance & Scalability topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter