Why does MySQL perform poorly with nested indices?
-
I have a mysql table (articles) with a nested index (blog_id, published), and performs poorly. I see a lot of these in my slow query logs: - Query_time: 23.184007 Lock_time: 0.000063 Rows_sent: 380 Rows_examined: 6341 SELECT id from articles WHERE category_id = 11 AND blog_id IN (13,14,15,16,17,18,19,20,21,22,23,24,26,27,6330,6331,8269,12218,18889) order by published DESC LIMIT 380; I have trouble understanding why mysql would run through all rows with those blog_ids to figure out my top 380 rows. I would expect the whole purpose of the nested index is to speed that up. To the very least, even a naive implementation, should look-up by blog_id and get it's top 380 rows ordered by published. That should be fast, since, we can figure out the exact 200 rows, due to the nested index. And then sort the resulting 19*200=3800 rows. If one were to implement it in the most optimal way, you would put a heap from the set of all blog-id based streams and pick the one with the max(published) and repeat it 200 times. Each operation should be fast. I'm surely missing something since Google, Facebook, Twitter, Microsoft and all the big companies are using mysql for production purposes. Any one with experience? CREATE TABLE IF NOT EXISTS `articles` ( `id` int(11) NOT NULL AUTO_INCREMENT, `category_id` int(11) DEFAULT NULL, `blog_id` int(11) DEFAULT NULL, `cluster_id` int(11) DEFAULT NULL, `title` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `description` text COLLATE utf8_unicode_ci, `keywords` text COLLATE utf8_unicode_ci, `image_url` varchar(511) COLLATE utf8_unicode_ci DEFAULT NULL, `url` varchar(511) COLLATE utf8_unicode_ci DEFAULT NULL, `url_hash` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL, `author` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `categories` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `published` int(11) DEFAULT NULL, `created_at` datetime DEFAULT NULL, `updated_at` datetime DEFAULT NULL, `is_image_crawled` tinyint(1) DEFAULT NULL, `image_candidates` text COLLATE utf8_unicode_ci, `title_hash` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL, `article_readability_crawled` tinyint(1) DEFAULT NULL, PRIMARY KEY (`id`), KEY `index_articles_on_url_hash` (`url_hash`), KEY `index_articles_on_cluster_id` (`cluster_id`), KEY `index_articles_on_published` (`published`), KEY `index_articles_on_is_image_crawled` (`is_image_crawled`), KEY `index_articles_on_category_id` (`category_id`), KEY `index_articles_on_title_hash` (`title_hash`), KEY `index_articles_on_article_readability_crawled` (`article_readability_crawled`), KEY `index_articles_on_blog_id` (`blog_id`,`published`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=562907 ;
-
Answer:
Two possible problems: 1) your WHERE clause includes "category_id=11", which means the (blog_id, published) key isn't specific enough. MySQL is going to have to pull up the row for every article with one of the listed blog_ids so it can check the category_id. Unless lots of those rows are in your page cache, the query will be slow. A possible solution is to create an index like (blog_id, category_id, published). 2) Even if you removed the category_id constraint or add the index I mentioned (try it and see what happens), MySQL will still probably execute your query inefficiently. Last I heard it wasn't very good at IN+ORDER BY. Check out http://www.mysqlperformanceblog.com/2006/08/14/mysql-followup-on-union-for-query-optimization-query-profiling/ -- in particular, the examination of "SELECT * FROM people WHERE age IN(18,19,20) ORDER BY last_online DESC LIMIT 10;" Like in the article above (and along the lines of one of your suggestions), you might see an improvement if you string together UNIONs: ...((SELECT * FROM articles WHERE blog_id=blog_id1 ORDER BY published DESC LIMIT n) UNION ALL (SELECT * FROM articles WHERE blog_id=blog_id2 ORDER BY published LIMIT n) UNION ALL (...) UNION ALL (...)) ORDER BY published LIMIT n This however would likely only help if the number of articles per blog_id is often more than n. If the query doesn't have to be that accurate, you could also potentially limit each subquery by a smaller number (say, 20) and then limit the total result by your larger number (380). I'm thinking off the top of my head here, but if you SELECT id FROM articles in each subselect (as opposed to SELECT *), all the necessary information could conceivably be grabbed from the index, since I believe InnoDB secondary indexes point to the primary key. That way, you wouldn't be hitting the table data at all; everything would be drawn from the index. It could be dramatically faster -- look for "using index" in the output from EXPLAIN to see if this behavior is actually going on. After you have your sorted list of article ids, you can do a second query: SELECT * FROM articles WHERE id IN(article_id1, article_id2,...). Then, on the client side, use the sorted article id list to construct a list of sorted articles. Note by the way that running a query with the same list of blog_ids repeatedly won't be a good benchmark of performance production; the hot rows will end up cached for a while, and the query will be quick afterwards. As you're testing, either switch up the list of blog_ids or use EXPLAIN to see what's happening. If none of this works, you could maintain a map of [blog_id:category_id] -> [list of n most recent (article_id, published) tuples] in memcache. When you need to answer a query, do a multiget on the blog_ids and then sort the resultant articles afterwards (you'd need to hit the db for missing keys). It'd be annoying to maintain the memcache entries though.
Ted Suzman at Quora Visit the source
Other answers
Did you try: "EXPLAIN SELECT id from articles WHERE category_id = 11 AND blog_id IN (13,14,15,16,17,18,19,20,21,22,23,24,26,27,6330,6331,8269,12218,18889) order by published DESC LIMIT 380" ?? That should always be your first step. If you did that you'd see: "Extra: Using where; Using filesort" Using filesort is your first clue you don't have the correct indexes for this table. In this case the issue isn't the IN (...) clause, MySQL's handling of IN has gotten better in the 4 years since that article was published. Your problem is your missing an index where the results are sorted by published. Try alter table articles add index by_published (published, category_id, blog_id); and throw a "USE INDEX by_published" on your query.
Kellan Elliott-McCrea
Related Q & A:
- Why Can't I find my databases from Mysql on linux?Best solution by Database Administrators
- How to access a nested function from another nested function in javascript?Best solution by devarticles.com
- Why doesn't MySQL upload my data properly?Best solution by php-mysql-tutorial.com
- Why is AdSense paying so poorly?Best solution by Yahoo! Answers
- What happens if I do poorly on a civil service test?Best solution by shr.illinois.edu
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.