How to solve such an optimization problem efficiently??

What is the best way to solve a convex optimization problem on Hadoop ?

  • I am curious if the Hadoop infrastructure supports a general purpose parallel SGD or CD solver?  Or even a CG solver?       What does it look like?  Is it available for general use? examples would be http://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf note: I rephrased this question to tighten it up and get to the actual question I am asking.

  • Answer:

    I don't have a great answer to your question, so as compensation for your credits, will kind of answer a different question. Solid? yes its been stable for years. It sounds like you're looking MPI-like infrastructure. Hadoop does nothing specific to linear algebra, nothing I would call "cache-aware" in the CPU sense. The Mappers and Reducers are black boxes. Hadoop's optimizations are at the I/O and network level, like moving the work to the data. Nothing on Hadoop or Storm will compete with tuned linear algebra library on one big machine, for speed. For the kind of design you have in mind, I think you want something lower-level to build on, like YARN. Hadoop in the sense of "MapReduce" is a data-parallel paradigm and I think you would lose more from its limitations than you gain if you want to build the kinds of things you have in mind. I don't know of a good SGD or matrix library for Hadoop. They're writable and you will find some starts in Mahout for example. I can speak from experience trying to make NNMF run fast on Hadoop (http://myrrix.com) There's an in-memory, parallel Java implementation (ALS-WR-based) and an implementation on Hadoop. Let's say they're near-optimal. If the in-memory implementation takes about X resources to factor a given input, the Hadoop implementation takes 2-4X. That is, all that serialization and network and such is most of the work. (In comparison, I found the equivalent in GraphLab, a slick/efficient distributed paradigm, was at about 2X. Better than Hadoop, slower than in-memory of course.) To answer your points: in Hadoop I would not do random access to HBase from a mapper. If you need a join, usually you try to put half the join (the small half!) in memory. This makes all the difference and is a common, necessary design 'cheat' that you can almost always get away with. You can control partitioning; I myself did only something lightly intelligent there. There is no way to do anything significantly complex in one Map/Reduce, so it's hard to say "what's necessary in the reducer". My pipeline involves 9-15 Map/Reduce jobs, of which the core is 1, iterated. I have no benchmarks on a high-speed rack. Amazon EMR performance is "pretty good" relative to a cluster of commodity machines, based on a range of cluster experience. Use m1.xlarge instances for best bang-for-the-buck. Now to the question you didn't ask. Given the above, why would one (like me) bother with Hadoop, and why wouldn't one (like you) bother with a heady blend of MPI, Python, LAPACK, GPUs, etc? 5 years ago, anyone who knew what these things were was sophisticated and had a sophisticated problem, and maybe even a large one for the time. Today, the market is broader. Many people don't have big, sophisticated data problems but still want to solve them. If you don't have a big problem that outgrows a monster 64GB machine, you don't need distributed anything. If you don't have sophisticated needs, you also probably don't want to bother with anything so complex as what you describe. Some people have a big data problem, or will, or will if it becomes easy to process a lot of data. This will outgrow one machine, so a lot of the sophisticated stuff of the past 30 years goes out the window. For better or worse Hadoop is the thing that people have at their fingertips to distribute. Can you implement the same stuff on Hadoop? yes, it's just Java and fully expressive. Is it efficient? no, not compared to in-core implementations. But, weirdly, it doesn't matter. Computing is so cheap that being 2-4X slower is not a big deal. Maybe I don't mind if my cost went from $5/day on a big machine to $20/day by distributing it on something like Hadoop. I care that it works, uses the infrastructure where my data already is, that I didn't have to hire a consultant to build it, and that it scales reasonably. Maybe at $2000/day it hurts and I justify building something custom. This is the shape of the argument why Hadoop (and successors) are still the interesting place to build for the mass market. Despite a lot of individual reasons it's not optimal for individual problems.

Sean Owen at Quora Visit the source

Was this solution helpful to you?

Other answers

Mahout's SVD is quite reasonable for very large sparse problems.  As Sean says, it isn't going to compete with small solutions like single machine BLAS or even small cluster MPI.  It will kill them at sufficiently large scale. In practice, iterative matrix factorizers work very well for very large sparse systems on Hadoop since you often have a good approximation (i.e. yesterday's answer) and can converge in a single step. As far as finding useful large scale numerical applications, check out Vowpal Wabbit for SGD.  Or SkyTree for MPI-like machine learning. And remember that with a system like MapR, it is quite feasible to mix conventional MPI steps with real-time incoming data with Hadoop reduction steps.  This is largely because the file system is much more like a file system than HDFS is so conventional programs work like you would expect.

Ted Dunning

Stephen Boyd is active in the area of distributed optimization.  Check out http://www.stanford.edu/~boyd/papers/admm_distr_stats.html for the major algorithm, details on a MapReduce implementation, and related work.

Justin Rising

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.