Why do hash functions use prime numbers?

November 20, 2008

Why do hash functions use prime numbers?

In a previous post i pointed out how questions posted in reward based discussions sites like stackoverflow.com never gets answered satisfactorily. This post is a look at one such feeble answer and makes an effort to explain in more detail a basic question about hashes.

The Question

In Java, the hash code for a String object is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the i^th character of the string, n is the length of the string, and ^ indicates exponentiation. Why is 31 used as a multiplier? So why not 29, or 37, or even 97?

The partially wrong answer

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

NOTE : Primes are part of an older technique for hashing and there are supposedly better hashes which use more sophisticated techniques. I do not know the mathematical foundations for these, so cant vouch for them but its true that the prime number technique is quite old.

The correct but longer explanation

The reason 31 is chosen is because it is prime – not because it is odd. It so happens that primes are un-divisible by any other numbers. This includes 2 and this makes it odd too but the real reason, is its primeness and one other factor which we shall go into shortly .

So why a prime ?

Hashing works by allocating a value into one of the many storages, a hash has, for fast retrieval later on. The storage a hash has is also termed as buckets in comp literature, for reasons which will become clear later on.

Now, how does the hash identify which bucket it needs to store the value in? This is an important question, due to the property of hashes, which makes it compulsory that a hash be able to tell you in constant time (which is hopefully fast) in which bucket the value is stored in.

The slooow hash find logic

for(i = 0; i < totalkeys; i++) { if (CheckForEquality(key[i], "Samuel")) return i; } [/sourcecode] This sort of sequential search, would cause the hash performance to worsen, directly dependent on the number of value it contains. In other words, you would have a linear performance cost (O(n)), which becomes progressively bad with larger and larger no of keys(n). The other complication is the actual type of the value you are dealing with. If you are dealing with strings and other such complex types, the no of checks or comparisons itself becomes prohibitive in terms of cost.

Now we have 2 problems

Complex values which are difficult to compare
Sequential searches are not fast and cannot give constant time lookups

Solving the complex value problem

The easy way out for this issue is to derive a way to decompose complex values into a key or a hash that is easy to work with. The easiest way to achieve this of-course is to generate UNIQUE numbers from your value. The number has to be UNIQUE since we want to distinguish one value from another. This UNIQUE side of things is where primes come in handy.

Primes

Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.

Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used.

However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about hashes without primes.

Now why is 31 used?

Ok, so now you have your unique identifier generated for each value. Now how do you allocate this value to a bucket or a container you have.

Lets say your container is a fixed array of 16 items. The easiest way to decide where to stick the value or the generated unique number, from now on referred to as the hash key, is to stick the key in the same location as its value.

So if your hash is 3 you stick your key in the location number 3. But of-course what would happen if the hash number is bigger than your container size. For this reason the initial or the very first hashing algorithms used to divide the generated hash key and stick the key in the location pointed to by the remainder. (This again necessitated that the no of storage locations in a hash was a prime number so as to generate unique remainder values).

So assuming your hash has 11 items, the hash key 12, would be stuck into 12%11 remainder 1 – 1st location.

why is 31 used?

Well, when you stick your keys in the containers, depending on the prime used, you would get a different key number and therefore a different distribution of the keys in your array.

So the same key “abc” would be a*31+b*31+c*31 for prime 31 and it would generate a different key for the abc with 29 – a*29+b*29+c*29.

Since the key produced is different, the data would go into a different location depending on the prime used.

Researchers found that using a prime of 31 gives a better distribution to the keys, and lesser no of collisions. No one knows why, the last i know and i had this question answered by Chris Torek himself, who is generally credited with coming up with 31 hash, on the C++ or C mailing list a while back.

Collissions

There are chances that certain strings might cause the same key to be generated. In such cases, the individual has storage is turned into a link list or some other type of storage that can store all the duplicate keys. This is the reason why the individual hash storage is called as a bucket.

Better hashing algorithms

But the modulo hashing algorithm we just looked at is extremely simplistic. There are much more complex and more faster hash algorithms that can be used. There are even tools that will generate the algorithm for you, given the type of keys you use.

Here are some fast and better distributing hash functions / modules if you wish to improve on the ones you have –

Paul Hseih’s Hash

Bob Jenkin’s Hash and his Dr Dobbs Article on the same

One from Google

3 cheers for better performance !!!

42 Responses to “Why do hash functions use prime numbers?”

Rob Says:

May 13, 2009 at 8:22 am
23 % 11 is 1, not 2

Reply
1. computinglife Says:
  
  May 13, 2009 at 9:34 am
  Thanks. Made the edit and added a few fast hash resources towards the end.
  
  Reply
Billy McCafferty Says:

May 25, 2009 at 1:31 pm
Good stuff!

Reply
Radu Says:

June 30, 2009 at 7:23 pm
“Primes are unique numbers. They are unique in that, the product of a prime with any other number is never the same when you change the number you multiple it with.”

Is “multiple” a typo (multiply instead?)?
Anyway, I think the sentance is true for any 2 given numbers (none of them has to be a prime), i.e. given a fixed number A, (not necessary a prime, but not 0) an an other number B, their product will always be different than A and C, where C is not B

Reply
1. Sorin Says:
  
  August 2, 2010 at 6:13 pm
  It is sentence rather than sentance if you want to be pedantic. The statement above is trivial.
  
  Reply
  1. Pete Wilson Says:
    
    September 30, 2010 at 4:01 pm
    “It is sentence rather than sentance if you want to be pedantic. The statement above is trivial.”
    
    Huh? Documentation /requires/ precision, too often mocked as pedantry. Is /any/ correction too “pedantic”? The statement is right on the money.
    
    Reply
2. Kenan Liu Says:
  
  September 15, 2016 at 2:56 am
  The key idea is “Primes are unique numbers”. For example, if you use 4 * 8, it has more possibilities be collided than product of primes like: 3 * 5. 32 can be calculated by 1*32 or 2*16 or 4*8 or 2^5 etc. But you can get 15 only by 1*15 or 3*5.
  
  Reply
computinglife Says:

July 1, 2009 at 6:16 am
Yes – you are right – i made the change to text to read what i meant about the numbers having a better chance of being unique due to the prime being used.

Of-course distribution entropy and such was not mentioned, as i never intended to go too technical anyway

Check this page for some detailed analysis – http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth

Reply
C#: GetHashCode() might cause OverflowException | Blinded by the lights Says:

September 13, 2010 at 1:00 pm
[…] By the way, if you speak on better uniqueness of the calculated hash, you might include prime numbers as a carefully selected random number multiplier, which in details is described there. […]

Reply
Nelson Says:

May 1, 2011 at 5:21 pm
Thanks a lot for the article, an awesome read.

Reply
computinglife Says:

May 1, 2011 at 5:59 pm
You are welcome. Glad you found it useful.

Reply
curious Says:

March 29, 2012 at 1:44 am
“Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used.”

I don’t understand. How can this method produce unique hash codes? wouldn’t “Samuel” and say, “Saluem” or even “Sbmudl” (a+1, e-1) produce the same hash codes?

Reply
1. Malarchist Says:
  
  April 8, 2012 at 3:35 pm
  I think that’s why weighting was mentioned right at the beginning. Multiply the first character by 1, second by 31, third by 31^2 and so on. However, this might result in an overflow as numbers could get very large. From experience in java, when integers overflow, they become negative. So I just test if my hash key is negative and if so, multiply it by -1 to avoid overflow
  
  Reply
GetHashCode – what, how, why « Fast .NET Says:

August 23, 2012 at 10:44 am
[…] <Why do hash functions use prime numbers – Computing life> […]

Reply
compare car insurance houston tx Says:

September 29, 2013 at 11:27 pm
I love looking through a post that will make people think.

Also, thank you for permitting me to comment!

Reply
iOS系列译文：值对象 | zengine Says:

December 23, 2013 at 5:31 am
[…] Why do hash functions use prime numbers? […]

Reply
hashCode and equals methods in java | javaworldblogs Says:

May 16, 2014 at 5:26 pm
[…] Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations. Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function. 31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2). https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/ […]

Reply
travel Insurance after Cancer Says:

June 28, 2014 at 2:15 am
Hi, this weekend is good designed for me, because this occasion i am reading this
impressive educational paragraph here at my residence.

Reply
top Air conditioning Repair services palm beach city fl Says:

August 13, 2014 at 8:14 pm
I like the helpful info you provide in your articles. I will bookmark your weblog and check again here frequently.

I am quite sure I will learn plenty of new
stuff right here! Good luck for the next!

Reply
How to: Why should hash functions use a prime number modulus? | SevenNet Says:

December 10, 2014 at 5:43 pm
[…] https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/ […]

Reply
Fixed: Why should hash functions use a prime number modulus? #solution #it #dev | Good Answer Says:

January 2, 2015 at 11:02 am
[…] https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/ […]

Reply
Resolved: Why should hash functions use a prime number modulus? #computers #programming #solution | IT Info Says:

January 12, 2015 at 7:34 pm
[…] https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/ […]

Reply
Stuff you positively definitely need to know | THE INTERVIEW HACKER Says:

January 11, 2016 at 12:04 am
[…] 29. Why do hash functions use prime numbers? […]

Reply
Java HashCode | Joys Of My Life Says:

February 3, 2016 at 2:35 pm
[…] Why do hash functions use prime numbers? […]

Reply
Help in understanding the properties of prime numbers - MathHub Says:

April 3, 2016 at 11:21 am
[…] At first I couldn’t get why use a prime when I came to this Why hash functions use primes: […]

Reply
Hashing 的技术 – zhukeyao Says:

November 4, 2016 at 4:54 am
[…] https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/ […]

Reply
Sandeep Kamath Says:

November 29, 2016 at 5:16 pm
The explanation is very well though off and understood

Reply
Chetan Maheshwari Says:

June 30, 2017 at 3:34 pm
Aftre spending a lot of time on other blogs and sites, finally I got the satisfactory solution.

Thanks a lot and nice explainiation.

Reply
Why hashing functions use 31 in Java – Trying to post some tech ideas that enthused me. Says:

October 14, 2020 at 9:22 am
[…] Was reading about equal and hashcode override contract and came across this doubt. Why do Java classes for hashing use 31? Here’s a good read […]

Reply
Two hits « Off the record - Craig Box Says:

February 8, 2021 at 2:23 am
[…] Result #1: Why do hash functions use prime numbers? […]

Reply
[language-agnostic] 해시 함수가 소수 모듈을 사용해야하는 이유는 무엇입니까? - 리뷰나라 Says:

August 13, 2021 at 10:05 am
[…] https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/ […]

Reply
Brian Says:

September 3, 2021 at 8:02 pm
Brian

Why do hash functions use prime numbers? | Computing Life

Reply
Help in understanding the properties of prime numbers – Math Solution Says:

March 24, 2022 at 8:35 pm
[…] At first I couldn’t get why use a prime when I came to this Why hash functions use primes: […]

Reply
&period;NET performance optimization – you should set initial size for collection type – JIUYASHUIJING Says:

April 27, 2022 at 6:51 am
[…] have no other divisors), which is beyond the scope of this article, you can poke for details.link 1,link 2,link 3learn […]

Reply
.NET性能优化 – 资源中心 Says:

April 27, 2022 at 7:06 am
[…] 从上面的代码我们就可以得出HashSet和Dictionary每次扩容后的大小就是大于二倍Size的第一个素数，和List直接扩容2倍还是有差别的。至于为什么要使用素数来作为扩容的大小，简单来说是使用素数能让数据在Hash以后更均匀的分布在各个桶中(素数没有其它约数)，这不在本文的讨论范围，具体可以戳链接1、链接2、链接3了解更多。 […]

Reply
.NET性能优化-你应该为集合类型设置初始大小 – 知识星宿 Says:

April 27, 2022 at 11:19 am
[…] 从上面的代码我们就可以得出HashSet和Dictionary每次扩容后的大小就是大于二倍Size的第一个素数，和List直接扩容2倍还是有差别的。至于为什么要使用素数来作为扩容的大小，简单来说是使用素数能让数据在Hash以后更均匀的分布在各个桶中(素数没有其它约数)，这不在本文的讨论范围，具体可以戳链接1、链接2、链接3了解更多。 […]

Reply
.NET性能优化-你应该为集合类型设置初始大小 Says:

August 2, 2022 at 12:40 pm
[…] 从上面的代码我们就可以得出HashSet和Dictionary每次扩容后的大小就是大于二倍Size的第一个素数，和List直接扩容2倍还是有差别的。至于为什么要使用素数来作为扩容的大小，简单来说是使用素数能让数据在Hash以后更均匀的分布在各个桶中(素数没有其它约数)，这不在本文的讨论范围，具体可以戳链接1、链接2、链接3了解更多。 […]

Reply
Why can't I compile an unordered_map with a pair as key? - Design Corral Says:

February 10, 2023 at 3:28 am
[…] for std::pair<std::string, std::string>, as presented by the OP. The example also uses a handcrafted combination of std::hash<std::string> function […]

Reply
car ecu location Says:

August 28, 2023 at 4:50 pm
car ecu location

Why do hash functions use prime numbers? | Computing Life

Reply
ford ecu repair Says:

September 2, 2023 at 2:28 pm
ford ecu repair

Why do hash functions use prime numbers? | Computing Life

Reply
92 Farcaleniom`s statement on its official blog Says:

September 16, 2023 at 7:03 am
92 Farcaleniom`s statement on its official blog

Why do hash functions use prime numbers? | Computing Life

Reply
Get My Boyrfriend Back Says:

January 9, 2024 at 5:58 pm
Get My Boyrfriend Back

Why do hash functions use prime numbers? | Computing Life

Reply

Computing Life