Slightly tangential, but I'd like to shamelessly plug my HDFS client library (with nice Python bindings)[0].
If you want to access files on HDFS from your Python tasks outside of your typical map / shuffle inputs from the streaming API, it might be handy? It doesn't go through the JVM (the library is in C), so it might save a little latency for short Python tasks.
[0]: https://github.com/cemeyer/hadoofus
Also, I'm pretty new to publishing my own open source libraries. If people would be so kind, I'd love some constructive criticism. Thanks HN!
The only framework I've used of the ones discussed is Hadoop Streaming with Python. For our use case (rapid prototyping of statistical analytics on several-terabyte structured data), it worked perfectly and was almost frictionless.
As the article calls out, we had to detect the boundaries between keys manually, but that didn't add much complexity. We called into Python through Hive scripts. Our group had little prior experience; some had used Python before, no one had used Hive/Hadoop much or at all, but we were all productive within a day of ramp-up time. I'm positive we were more productive than if we'd implemented everything in Java, even for the people more experienced in Java than Python.
If I have any future projects requiring similar analysis, I'd like to use Hive with Hadoop Streaming and Python again.
A while back I did some tests to compare the Hadoop Streaming library to the Dumbo library. I have to agree, if your MapReduce jobs are not crazy complicated, and don't require chaining...it's best to just write it using the raw Streaming library.
You can do hbase stuff via streaming as well with this library https://github.com/vanship82/hadoop-hbase-streaming
Former[1] primary mrjob maintainer here, thanks for the shout-out! I'd like to make a couple of notes and corrections. In particular, I thank you for recommending mrjob for EMR usage, as it's something we've made a point of trying to be the best at.
First of all, all of these frameworks use Hadoop Streaming. As mentioned in the mrjob 0.4-dev docs [2] (pardon the run-on sentence):
"Although Hadoop is primarly designed to work with JVM code, it supports other languages via Hadoop Streaming, a special jar which calls an arbitrary program as a subprocess, passing input via stdin and gathering results via stdout."
mrjob's role is to give you a structured way to write Hadoop Streaming jobs in Python (or, recently, any language). When your task runs, it's taking input and output the same way as your raw Python example does, except mrjob is passing it through to the methods you've defined after running each line through a deserialization function. It picks the code to run based on command line arguments such as --mapper, --reducer, etc. The output of the code is again serialized. The methods of [de]serialization are defined declaritively in your class, as you showed.
So why did you find mrjob to be slower than bare Hadoop Streaming? I don't know! In theory, you're running approximately the same code between the mrjob and the bare Python versions of your script. If anyone has time to dig into this and find out where that time is being spent, I would be grateful. Results should be sent to the issue tracker on the Github page [3].
Feel free to ask clarifying questions. I realize I may not be explaining this effectively to people unfamiliar with the ins and outs of Python MapReduce frameworks.
I'm thinking of organizing the 2nd "mrjob hackathon" in the near future, so please ping me if you're interested in contributing to an easy-to-handle lots-of-low-hanging-fruit OSS project. (Particularly if you're a Rubyist, because we have an experimental way to use Ruby with it.)
[1] mrjob is maintained by Yelp, where I worked until recently. It's still under active development, though it's slowed somewhat since Dave and I moved on.
[2] http://mrjob.readthedocs.org/en/latest/guides/concepts.html#...
[3] http://www.github.com/yelp/mrjob/issues/new