How does Google Analytics generate reports so quickly?
-
I'm sure that part of this is huge infrastructure, but I'm more interested in the database structure that helps allow Google's systems (Analtyics and AdWords) to quickly (sub 1 second) go from a keyword level report to a summary report that shows total clicks and cost for all time. Is this based on precompiled views or something else? What kind of maintenance operations are required to allow for such broad reporting that can be retrieved so quickly?
-
Answer:
I do not know the exact technology but there are many similarities between data-warehousing / BI systems and web analytics systems. Here are some ways it could be done: You could use pre-calculated OLAP cubes with the data coming from a slower relational or NOSQL data store. You could use a traditional row oriented relational database with a star schema (ROLAP) and then for performance hold the data collapsed in various dimensions to avoid aggregating all the data on the fly. You could use a software or hardware solution based around a column oriented database and a star schema. These systems are hugely faster for analytics queries than row oriented databases. Although you could still also use aggregates. One thing I have noticed is that not all combinations of breakdowns are available this would indicate that the data is most likely pre-aggregated either in OLAP cubes or a ROLAP solution using collapsed dimensions and that they have not bothered doing this for every single variation you might want to query on - just the most common ones. For more info on some of these terms, wikipedia is pretty good.
Matthew Cooke at Quora Visit the source
Other answers
Dremel and F1. Open source trying to catch up with http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
Neetesh Tiwari
Others have provided good ideas about how the data is processed, and those specifics I do not know. I can add this about the 'fastness' of the reports: The fast reports in Google Analytics are so called "pre-aggregated tables", i.e., they are already processed. All standard reports are already processed, thus fast, and also unsampled. As soon as you make a custom query, by adding a secondary dimension, applying a custom segment or generating a custom report, data is processed on the fly. If there are more than 250K-500K visits (depending on your setting of the slider) for the queried date range, GA will sample the data to your specified sampling setting, yielding the yellow box: Once on-the-fly queries kick in, GA will base the query on the Web Property data. So if you're looking in a filtered profile, and applying a custom segment GA will: Choose a maximum of 500 K visits from the web property (the UA-ID), evenly distributed for the date range (not accounting for peaks). Then apply the profile filter. Then apply the custom segmentation. Looking at standard (pre-aggregated) reports in a filtered profile, will yield unsampled data. Google recently published this article on sampling: http://support.google.com/analytics/bin/answer.py?hl=en&answer=2637192&topic=2524483&ctx=topic
David Andersson
Pre-aggregated tables are a great way of delivering results (almost) instantly, so long as you have an idea of the range of answers you need to deliver in advance. With fixed dashboards, you typically do. Acunu (disclosure: I founded the company) offer a real-time big data analytics platform (http://www.acunu.com/acunu-analytics.html) built over Apache Cassandra that does just this -- it takes JSON events, and definitions of real time views, and maintains those materialized views (or rollups on a semi-structured cube of events, if you like) continuously. Then queries (e.g. to populate a dashboard) typically take ms, since the results are pre-computed or readily computable from intermediate results. Note that this is a very different approach from Dremel, Drill and Impala, and in fact most data warehouse systems, which aim to deliver unplanned queries quickly.
Tim Moreton
Please read the Dremel paper; I believe its title is "Dremel: Interactive analysis of web scale data sets". Dremel is the analytics technology used and developed at Google.
Radha Krishna Kanth Popuri
Related Q & A:
- How do I add Google Analytics to my Yahoo Group and Yahoo 360 Blog?Best solution by Yahoo! Answers
- Where Do I Place Google Analytics Code Inside a .ASPX Page?Best solution by Stack Overflow
- How to track clicks of ads in email using google analytics?Best solution by Stack Overflow
- How to install Google Analytics?Best solution by Yahoo! Answers
- How do you track adsense revenue through Google analytics?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.