What is the optimal way to store and index a GUID field in MySQL (InnoDB)?
-
I'm currently retro-fitting an InnoDB-based MySQL table with GUIDs. In the short-term, I'll be assigning each existing record a GUID and indexing on it (in the long-term, I'll be migrating to a K-V store). I have two optimizations in mind, and I'm wondering what people (who understand the actual guts of InnoDB indexing) think. Optimization #1: Only randomize some of the GUID's digits Because my dataset will remain relatively stable and small (on the order of millions of records) until I have time to move to a K-V store, I'm thinking an optimization is to hold the first 26 characters of my GUIDs constant and randomly generate the final 6 digits, checking for duplicates on insert. Thus, I have an effective ID space of about 16M keys (16^6). I'm thiking this will cluster the record references together at the bottom of the tree, thereby decreasing the number of branching points and increasing locality during index traversals. Am I offbase in trying to manipulate the index this way? Optimization #2: Pack GUID into an integer From various forums, it sounds like MySQL is better at indexing and joining on integers rather than strings, so is it likely to be a substantial improvement for me to pack my GUIDs into an UNSIGNED BIGINT and "CONV(<guid>, 16, 10)" on the way into the database and "HEX(packed_guid_column)" on the way out? I ask, because whenever too much cleverness is employed while using a complicated, real-world system, I worry that it's either needless effort or, in the worst-case, undermining the optimizations that the system is built to accomplish. Does anyone know if either of those is happening here, or if I should be thinking about different tactics?
-
Answer:
Regarding option #1: I don't understand how that would help. The B-tree data structure stores the least value at the left of the tree, the greatest value at the right of the tree, and the median value at the root. B-trees do not span the whole range of possible values, they span the range of values you actually stored. So your idea of only using one end of the B-tree is nonsense. And even if it did work as you think, that wouldn't help anyway, because the values would just make the tree grow deeper instead of wider, which would actually be worse for performance because it would require more steps to traverse the tree. Besides, restricting the range of values in your "GUID," you will increase the likelihood of collisions. You have 16^6 possible values, but the chance of a collision increases as you store more rows. That's why a GUID is 128 bits, to reduce the chance for collisions. The best way therefore to prevent collisions is to generate non-random, incrementing values in your GUID. Which means you might as well use a MEDIUMINT AUTO_INCREMENT. Regarding option #2: A GUID is 32 hex digits, or 128 bits. There's no way to fit that in a BIGINT UNSIGNED, which is only 64 bits. Notice that MySQL's UUID() function doesn't change the lower bits: mysql> SELECT UUID(); +--------------------------------------+ | UUID() | +--------------------------------------+ | 6ba56732-850e-11e4-8730-080027f2a6c9 | | 6bf114ba-850e-11e4-8730-080027f2a6c9 | | 6c2baceb-850e-11e4-8730-080027f2a6c9 | | 6c664b44-850e-11e4-8730-080027f2a6c9 | | 6c9fa62d-850e-11e4-8730-080027f2a6c9 | | 6cd9067c-850e-11e4-8730-080027f2a6c9 | +--------------------------------------+ If you try to convert that value into a 64-bit BIGINT, it truncates the value to the lower 64 bits, losing the only random part of the number. mysql> SELECT CONV(REPLACE(UUID(),'-',''), 16, 10); +--------------------------------------+ | CONV(REPLACE(UUID(),'-',''), 16, 10) | +--------------------------------------+ | 18446744073709551615 | | 18446744073709551615 | | 18446744073709551615 | | 18446744073709551615 | | 18446744073709551615 | | 18446744073709551615 | +--------------------------------------+ MySQL has a function http://dev.mysql.com/doc/refman/5.6/en/miscellaneous-functions.html#function_uuid-short that returns a 64-bit unsigned value, but this isn't really random either, it's partially based on the server-id, and has an auto-increment component. But the function uses a global mutex to assure multiple threads don't allocate the same value. This could be bad for scalability if you're inserting into multiple tables in different threads.
Bill Karwin at Quora Visit the source
Related Q & A:
- What is the Galois group of a polynomial over a finite field?Best solution by Mathematics
- What is the optimal training time for a combined weight training and cardio workout?Best solution by Physical Fitness
- What is the quickest way to get rid of a scrape?Best solution by answers.yahoo.com
- What to wear and how to present myself for a job interview at a retail store?Best solution by Yahoo! Answers
- What would you think or do if someone was to camp out in a store to watch a football game?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.