How do you store large amounts of impression/log data in MySQL without bringing down the MySQL server?
-
It seems I'm doing it wrong: I'm directly logging impression data from over 100 websites directly into MySQL in realtime, but the server just crashes. What are the best practices for doing this? Should I serialize to a file on disk and periodically move to SQL?
-
Answer:
Your insertion strategy is going to depend on how you do your reads. The worst case scenario is if you need real-time access to the data. For instance, if you provide a statistics page using this data that is expected to be current whenever the user accesses it, or perhaps provides some kind of streaming real-time graph of traffic. In this scenario, you might be best served by something other than a single MySQL server. MySQL is primarily designed around OLTP, so it's expecting a 1:1 or higher read-to-write ratio. Most logging systems end up executing hundreds, thousands, or even millions of times more write than read transactions. If you really need to stick with MySQL and need real-time reads, I'd first check to see exactly what's killing the server. Run a 'vmstat' and look to see if you're incurring high I/O wait times, or if CPU (user %) is being consumed. If you've got a high I/O wait %, the disk is the bottleneck. If the disk is the bottleneck, outside of purchasing faster disks (SSDs would be great), I'd recommend moving this to it's own server (isolated best, but it can run alongside another server on the same box) and relaxing some of the durability guarantees in the MySQL configuration. Make sure you're using MySQL 5.4+ and InnoDB storage engine. Set innodb_flush_log_at_trx_commit to 0. This will allow MySQL to effectively batch up INSERTs and will probably be a huge win in your case. The only downside is you can lose up to a second worth of INSERTs if the database were to crash, which is likely not a show-stopper for your use case. I'd also drop any non-critical indexes off the tables. Most people can get away with less indexes, and they can easily be the most expensive part of write transactions.
Rick Branson at Quora Visit the source
Other answers
I'd recommend using Flume to populate an HDFS cluster, potentially running Hive or HBase as well. See for more details.
Jeff Hammerbacher
If you must stick with MySQL, a good option is to do lazy writes by rotating MEMORY tables in 1-minute increments, then write this data to disk by using ALTER TABLE ENGINE={MyISAM,InnoDB,ARCHIVE...} INSERTs to MEMORY are much faster than those to disk storage engines. Writing the entire table to disk at once (every 1 minute) is faster than writing each row individually. e.g. SHOW TABLES; log_2010-12-31_01-01log_2010-12-31_01-01 /* Live table receiving log INSERTs. MEMORY storage engine. */ One minute later... ALTER TABLE log_2010-12-31_01-01log_2010-12-31_01-01 ENGINE=MyISAM; /* writes table to disk*/ CREATE TABLE IF NOT EXISTS log_2010-12-31_01-02log_2010-12-31_01-02 (...) ENGINE=MEMORY; SHOW TABLES; log_2010-12-31_01-02log_2010-12-31_01-02 /* Live table receiving log INSERTs. MEMORY storage engine. */ log_2010-12-31_01-01log_2010-12-31_01-01 /* Old live table. No INSERTs being written to this table now. Storage engine is now MyISAM. */ This is not as scalable as Scribe or Hadoop, but is an efficient way to use MySQL and works out of the box with any vanilla MySQL installation.
J. Bryan Scott
1. DON'T PANIC! 2. Change your table type to InnoDB - it's better for inserts (which you are doing a lot of) than MyIsam. 3. Often the crash is being caused for a reason that you totally don't expect, like lots of threads being opened. Check out how many threads you have open at any one time, how long they're taking to die etc. 4. Check your code - both your MySql statements, and closing your connections or running permanent connections. 5. Run it all through a proxy. I like HAProxy, but there are also Sql-specific proxies available. 6. Tune tune tune. 7. Upgrade your server. 8. If you're still not having joy, you need to start looking at other DB technologies or clustering. Logging hits from 100 sites should be a fairly low load on a MySql server, unless they're massive sites. You should have fixed your problem by Point 5.
Jason Norwood-Young
* Determine the best storage engine you need. MyISAM, ARCHIVE and INNODB are all viable options depending on your retrieval purposes. * Separate your web logging from other OLTP traffic, that is a different MySQL instance, generally on a different server. This enables you to use a better strategy for backup/recovery for example, and also a different replication strategy. * Partitioning and Sharding will greatly improve performance. You need to specific more then just it's just crashing. Servers can easily support 1000s of writes per second. I have work with highly tuned and segmented MySQL infrastructure that does over 10,000 writes per second.
Ronald Bradford
If you don't want to move away from MySql, then a vanilla quick-fix approach is to put your incoming data stream into tokyo tyrant (or any other persistent key-value store), then let a 'batch insert' process run after regular intervals which dumps data into mysql. However, if your data stream is continuous and you do not have periods of low traffic, then this will not scale and you will have to go for a HBase/Cassandra like distributed collection. This may affect your analytics and querying capabilities, so be very careful while creating your indexing schema.
Devendra K Rane
Scribe, which is open sourced by Facebook, is used now a days for aggregating log data streamed from a set of servers (http://github.com/facebook/scribe). This of course, will not be a quick fix for your problem, but considering that you are getting the impressions from over 100 web sites, you will need to consider some kind of distributed collection and processing architecture to be able to handle that kind of data.
Raghavendra Kidiyoor
Don't use MySQL in a transactional manner to write the logs one by one in real-time. Even if you manage to tune performance to the point it serves your needs, it will be a major source of nervousness for you operationally. Running other jobs on the DB server, maintenance tasks, schema changes etc. all become potential points of operational errors that take your service down. I've managed teams building large ad-systems and analytics systems. In general, variants of the log-processing approach you have alluded to in your comments are used. Logs get sent from the source to log-collection-servers which just append to file. Log aggregation servers then collect all the logs and merge/summarize them and then write just summarized data (much smaller) in batch-mode to the database.With this model, you can afford to take your database offline for a while. Logs collect up (only limited by diskspace) and are processed when the database is back up. The downside to this approach is that your database will be not be "current" - it may be X minutes late, where X depends on how long the log-collection and log-processing steps take. If you want to go "cutting-edge", you could also try Hbase (which is like BigTable). It appends all writes to an internal log-file first and then "fixes" records later in the background. Thus it is similar to the log-processing model but abstracts it all away from you. Hbase is still very new, so if you go that route, tread with caution.
Anonymous
MySQL is not a good fit in this case unless you have a proper way to rotate the tables hourly/daily based by setting a hard limit so that the tables can be easily queried back. One option is to have buffered writing (locally store all your impressions/clicks impression) and do a batch loading to MySQL for any processing; so that you can survive from database crashes or use mysql sharding or mysql cluster. Either case; you should look into alternatives like what Jeff Hammerbacher has ponted; as you can't use to keep track of live impression/click streaming information in typical OLTP database as its not optimal and not scalable; rather get all your aggregated stats and then load them back in database for reports purpose; but not the raw information itself. Myself I have few live implementation across multiple companies using hadoop+MR processing and then load into MySQL DB for clicks+impression+conversion tracking; and few cases I also have vertica cluster in-place; both are optimal and can scale pretty linearly. If you wanted to generate BI + reporting on top of your click tracking; then you can use the same setup; by generating daily/hourly/weekly aggregated data by setting a simple ETL process; and then use OLAP server to build any cubes by polling the data directly from fact+agg tables; so that you can easily generate sliced and diced reports (pentaho or jasper or custom)
Venu Anuganti
To add to what has been said, you probably should not store large amounts of log data in a MySQL database. That's just not what MySQL is best used for day to day. I've seen this very thing bring down numerous applications over time. Take a look at one of the various NoSQL alternatives for storing your log data such as MongoDB or Riak perhaps. They each have their pros and cons but can be useful in this scenario.
Kent Langley
Related Q & A:
- How to store large ordered data?Best solution by Stack Overflow
- How to skip columns empty in CSV file when importing into MySQL table using LOAD DATA INFILE?Best solution by Stack Overflow
- How to do a local mysql database replication on an online server?Best solution by howtoforge.com
- How to start MySQL server on OSX?Best solution by Stack Overflow
- How to migrate data from mySQL to PostgreSQL?Best solution by Stack Overflow
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.