Which data stream mining tools can handle Big Data?
-
For traditional machine learning, it is necessary to periodically train a new model using current data, in order to respond to new circumstances. As this batch approach is sometimes too slow in this real-time world (e.g. for rapidly responding to changing fraud patterns), I'm looking for tools that can refresh models incrementally. I've found some data stream mining tools - tools that can apply machine learning algorithms (e.g. classification, regression, clustering) on a data stream rather than a data set - but I have not heard of any big implementations. Are there any data stream mining tools being used to process big fire hoses of data?
-
Answer:
If you prefer to work within the Java / Apache ecosystem, it should be quite straightforward to reuse the hashing text document vectorizer & Stochastic Gradient Descent logistic regression classifier impl from Mahout by wrapping them as S4 Processing Elements. You can also use zookeeper to periodically share & average the weights of various SGD instances running concurrently on the same S4 event stream in case one machine is not able to deal with the events rate. Weights averaging should work good enough on linear models. The other approach is to perform feature sharding as WV does in streaming distributed mode. In any case start with the singleton SGD PE instance before diving in a parallelized implementation. SGD updated are really fast and is it likely that upstream feature extraction will be the bottleneck anyway. As for online / streaming clustering, it should be quite straightforward to implement Sequential KMeans as a S4 processing element downstream the mahout hashing text document vectorizer: http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm A good trick to know when implementing this is to count the number of times a centroid is activated and discard the less activated centroids from time to time and replace them by random variations of the most activated centroids. It helps kmeans avoid being stuck in bad local optimum. Also kmeans can benefit a lot from whitening. However it is not possible (AFAIK) to compute a streaming online PCA estimatation. It might be interesting to experiment with running the mahout SVD implementation on a batch of historical data collected from your stream and then use the whitened singular vectors as a fixed (or batched, asynchronously updated) projection basis for preprocessing your streaming kmeans input.
Olivier Grisel at Quora Visit the source
Other answers
You may be interested in Vowpal Wabbit: http://hunch.net/~vw/.
Jeff Hammerbacher
Ok, I think some sorting is necessary to sift through the different answers here. First of all, as has correctly pointed out, there exist quite a few online versions of algorithms like clustering and classification which can naturally work with data streams. Stochastic gradient descent automatically works with data streams, for example. Probably the difference between using SGD on large data sets and on data streams is that data streams can be non-stationary and you need ways to deal with that. Here is pretty good tutorial on this topic: https://sites.google.com/site/advancedstreamingtutorial/ There exist open source projects which implement this kind of algorithms, for example MOA - Massive Online Analysis http://moa.cs.waikato.ac.nz/ or RapidMiner http://rapid-i.com Then there are software frameworks for scaling stream processing applications like Yahoo's S4 http://incubator.apache.org/s4/) or Twitter's Storm (http://storm-project.net/). But these are generally purpose frameworks which no built-in support for machine learning. You would have to do a fair amount of coding to make those work for you. Apache Flume is again a different piece of software which helps you to pipe stream data into a Hadoop cluster. But after that, you'd have to use a more batch oriented MapReduce approach to learn, so it's probably not sufficiently real-time. Complex Event Processing tools is another approach, which is basically something like SQL for stream data. But the focus is more on extracting statistics like running averages from your data, less on advanced machine learning. Finally, I'd also like to mention http://streamdrill.com, a real-time event analysis project I'm working on which let's you efficiently aggregate event activities over large item spaces efficiently. You can then build adaptive ML methods based on these statiatics, something we're working on right now.
Mikio L. Braun
DarkStar was built from the ground up for data stream mining. And is currently being deployed in an environment to deal with over 3,000,000,000 messages a day. Yahoo's S4 takes the Hadoop paradigm and makes it event driven. You may find that of interest. Contrary to popular belief, it's actually easier to do a lot of machine learning kinds of things on streams - it can be done in real time. I spend a lot of my day focusing on exactly those types of issues. For some insight, check out Jeff Jonas's blog http://jeffjonas.typepad.com. DarkStar, unlike S4 or DataSift (mentioned above), is a general purpose, distributed engine that incorporates 4 very important things: 1) Time Windows - the ability to look for things, or calculate aggregates, within time or length based windows. Like, 'what is the average price of IBM over the last 10 minutes. 2) Pattern Matching - (not regular expressions) - the ability to describe a sequence of events, like, "Tell me when a user tweets about airplanes, but only after another user tweeted about 9/11 and only if the tweets are within 5 minutes of each other. 3) Continuous Query - once a query is submitted for evaluation, the results are returned continuously until you're not interested in that query any longer. 4) A Language - it's nice to have a language to tie all of this together. Some products have simple DSL's that facilitate this, some don't. DarkStar implements a Turing complete language for the manipulation of streaming data. There's a lot of research in this area - look for information/books on sensor networks. You'll find the more advanced concepts described there with discussions about real implementations. There's a lot of work coming out of Portugal lately. One interesting offering is Knime. Or Weka. You may be able to cobble something together using components from either of those offerings. This is an exciting area of research and application.
Colin Clark
Our company specialises in this exact problem - our core product DataSift http://datasift.net takes in many of the most popular social media sites data and allows you to programmatically define what content you wish to retrieve. For detailed information about its abilities check out the knowledge base -> http://support.datasift.net/help/kb We process over 200 million pieces of data each day. The output from the streams can be consumed via HTTP Stream, Web Sockets and via a REST API. We are also working on a storage + mapreduce sysytem which will go into Alpha testing within a month.
Nick Halstead
You might want to take a look at IBM's InfoSphere Streams (http://www-01.ibm.com/software/data/infosphere/streams/ or just google it). Not only can it handle extremely large volumes of both structured and unstructured data from heterogeneous sources, it also includes a fairly interesting set of analytical tools (e.g. neural network, dynamic decision tree, naive Bayes, Kohonen clustering, various forms of regression, and others) to support complex and dynamic analytics. It was originally conceived in cooperation with the US government (System S) back in 2003 and has been a commercial product for about a year and a half. Even though it's fairly young, there are already some very interesting applications that have been developed using it, some of which are processing extremely high volumes of data in real-time or near real-time. Similar to the previous post, the InfoSphere Streams programming language has great support for various forms of time windows and many other features to support continuous queries on data-in-motion as well as easily interfacing with data-at-rest. They recently published a redbook which you can find for free at http://www.redbooks.ibm.com/redbooks/pdfs/sg247865.pdf
Jim Sharpe
I would recommend utilizing Flume (https://github.com/cloudera/flume) for capturing the stream output and redirecting the output into an HBase sink to do your analysis. You don't mention how the data is being streamed but if you can capture it to a file, Flume will work just fine and has excellent horizontal scalability. It also has the ability to coalesce data from many sources and perform additional transformations before writing it to a sink, in this case HBase. HBase offers the ability to serve big data applications that require low latency, which sounds like it would be a good fit for the use case that you have in mind. From the flume docs: "Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms."
Don Brown
http://vimeo.com/107917677 http://www.sqlstream.com/blaze/streamlab/ enables business analysts and data scientists to explore and visualize machine data streams in real-time. http://www.sqlstream.com/blaze/streamlab/ offers a graphical stream browser for the interactive exploration of http://www.sqlstream.com/products/what-is-machine-data/, with built-in real-time dashboards for visualizing streaming data and analytics. No SQL or Java coding is required â all streaming data interactions are supported through the powerful GUI.
Andrew Bare
You may want to look at http://samoa-project.net an upcoming project that brings machine learning to Storm and S4.
Gimmy Goku
Related Q & A:
- Should "Big Data" be capitalised?Best solution by English Language and Usage
- How to Display Big Data On A Google Map?Best solution by gis.stackexchange.com
- How can I display data in a listview?Best solution by Stack Overflow
- How can I extract data of an external XML to PHP?Best solution by Stack Overflow
- Which free stream link have i to use for champions league?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.