How is the size of hash table determined?
-
Suppose I want to create efficient hash table for about 200,000 English words. I will use djb2 hash function ((seed_hash * 33) + word[i]) % SIZE , (http://www.cse.yorku.ca/~oz/hash.html) . I'll use bucket linked list for collision. I am using C (c99) , compiler gcc. How should optimization be done for it to be fast + optimized memory usage. I am thinking about size of 51200 , so that even distribution of 3-4 nodes per hash value will occur.
-
Answer:
Speed is usually more important than memory. Empty cells in the hash table cost you only a little. Program with an average of 5 items per hash table cell will, on average, be about 3-4 times as slow as a program with just 1 item per cell, because of the additional comparisons it has to do. You want a hash table that is larger than the number of keys you are going to store, so that the expected number of items per cell is less than 1. The particular hash function does not matter much if the input is guaranteed to be English words -- many functions will behave nicely enough for those. That being said, the simplest solution for your problem is to stop reinventing the wheel: if you can, just use an existing solution, such as the unordered_set in C++11 with its default hash function for strings, and everything will work just fine. Of course, if you know the set of words beforehand, the optimal solution is to use a https://en.wikipedia.org/wiki/Perfect_hash_function with no collisions at all. There is even a tool to find one for you: http://linux.die.net/man/1/gperf.
Michal Forišek at Quora Visit the source
Other answers
It sounds like you have a general idea about the characteristics of data you're going to be storing in your hash table. Given this information, if you want to optimize for this data set, you should write your program using different sizes for the hash table, time the implementations, and pick the fastest one. You have a large-ish data set, so maybe start testing in intervals of a couple thousand. You should get a nice inverse-bell curve. Find the data point that's lowest on that graph (lower time = better performance) and "zoom in" on that region, using smaller intervals of a couple hundred to find the optimal value.
William Chargin
If the data amount you are going to store is static (i.e. Always consistent in every execution), it is good by directly allocating array for your hashtable as many rows as you need. It will save the execution time by avoiding reallocation/rehash But if the data amount is in growing manner, you should allocate it on demand in order to save memory. You can start by having 1024 rows, then when it reach third quarter of its size, the hashtable size quadrupled to 4096 and so on. If you want to employ multithreading, you can also make the execution of rehash/reallocating the hashtable in different thread. It is just an idea though, but for an implementation, never really try it. I think it will be painful.
Gilang Mentari Hamidy
It's a tuning parameter - it depends what you're trying to optimize and what resources you have or are willing to commit but thinking performance will be proportional to average collision chain length is the right thing to be managing. I don't know what your application is but assuming you optimize collision handling 3-4 chains will be blisteringly fast on any modern laptop (up). If you're on a phone or smaller device you might find this is more of a size/speed trade-off. It's a common myth for this kind of hash-table that you should pick a prime number but because your hash value has a tendency to have a fixed remainder modulo 33 you should pick something co-prime with 33. A smart choice is a power of 2. That's because you can obtain the remainder by masking bits with & and avoid a (relatively) costly / implicit in your %. NB: A side-effect of using powers of 2 is that it's easy to divide or combine collision chains if you resize the table dynamically. However I get the impression you have a static dictionary and won't be re-sizing. Optimized? First, make sure you retain the full hash (an unsigned 32-bit int will be likely suitable). Second when traversing a collision chain compare hash before value. If the hashes don't match you don't need a (relatively) expensive string comparison. The hash you've chosen is known to have good performance with English text you should find few if any collisions at full 32-bit hash comparison and make next to zero failed string comparisons. Third consider ordering the collision chain. If access is random order it by hash value. That way you can dive out of a chain when you realize the look-up value can't be held. Alternatively if access isn't random consider ordering the collision chains by a static or dynamic frequency. Static frequency would be based on occurrence of a word in some text "corpus". That is you'd want 'the' to appear at the front of its collision chain and 'wayzgoose' likely towards the end! Dynamic frequency would involve moving words that are 'hit' to the front of their collision chain knowing words recur in a given text. If you are writing a spell checker (and I've somewhat assumed that's the application) I really do recommend finding a corpus. It doesn't even have to be very big because (of course) the common words are common and will sort to the front very quickly and even if the 'uncommon' words aren't optimized - they have less impact because they're uncommon! PS: I also know a practically perfect (in the formal sense) hash for English words however I think you'll find your hash is pretty good.
Dan Allen
Related Q & A:
- How to Display Products in a table format?Best solution by Stack Overflow
- How to insert data from one table to another?Best solution by Stack Overflow
- How do I create an HTML table, in jQuery, with JSON data?Best solution by Stack Overflow
- How do you create temporary MySQL table in SQLAlchemy?Best solution by Stack Overflow
- How to apply media query in table?Best solution by stackoverflow.com
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.