Are my efficiency concerns with our HBase schema warranted?
-
I am currently working on a system where we need to store millions of events on millions of users. Right now we are using HBase, but storing all of our data using JSON. The row and column keys are all randomly generated UUIDs, so there is no way to do efficient filtering based upon using key scans. Since the column names are random UUIDs (each column is an event in a user's profile), there is also no way only process on a single column of data. I am still able to process the data using tools like Pig and Hive, but it required me to write some user defined functions to parse the JSON (we have nested data structures which are not supported by UDFs found in libraries such as Elephant Bird). However, this approach meant there was no way to do any filtering of data without deserializing every row in our HBase table. I raised the concern that this was a highly inefficient use of HBase, but was given the response that it was normal in Hadoop/Large data to run jobs over all the data all the time. I am relatively new to HBase, so I was wondering if I am missing something.
-
Answer:
Optimizing the schema your using is highly dependent on the use case for the application. Are you pulling data out of hbase for a real time application, or simply using it for a large store of user data and offline analysis. In terms of filtering results that you get back from your hbase table in a more efficient way, you could change the row key's (hbase primary index) structure. If for instance your primary access pattern involves pulling back the last X number of events from a given user you could define your key structure as [User-UUID][TS][EVENT-UUID], you could also create a summary row at [User-UUID][0000...]. Populate the summery data with a periodic map / reduce job using the table input format / output format. To quickly pull the last X events create a scanner with a key prefix of [USER-UUID], that should get you sequential access to the user events. If you need to do more filtering I suggest generating a secondary index using Elastic Search, or lucene.
Alexander Daw at Quora Visit the source
Related Q & A:
- How to create an external table for hbase?Best solution by stackoverflow.com
- What are the basic health concerns about mold and mildew?Best solution by Yahoo! Answers
- What is the efficiency of an Average Nuclear Power Plant?Best solution by eia.gov
- What is the efficiency of refining petroleum?Best solution by answers.yahoo.com
- Are the media justified in how they raise concerns regarding different science?Best solution by uk.answers.yahoo.com
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.