I felt the topic to be at the core of what i do. So spun it off as a permanent page always available on the site.

### Data Structures

December 15, 2008

## What is computer science ?

Posted by computinglife under Data Structures, Programming | Tags: computing, Links |Leave a Comment

November 20, 2008

## Why do hash functions use prime numbers?

Posted by computinglife under Data Structures | Tags: Data Structures, Programming |[27] Comments

In a previous post i pointed out how questions posted in reward based discussions sites like stackoverflow.com never gets answered satisfactorily. This post is a look at one such feeble answer and makes an effort to explain in more detail a basic question about hashes.

**The Question**

In Java, the hash code for a `String`

object is computed as

`s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]`

using `int`

arithmetic, where `s[i]`

is the `i`

^{th} character of the string, `n`

is the length of the string, and `^`

indicates exponentiation. Why is 31 used as a multiplier? So why not 29, or 37, or even 97?

**The partially wrong answer**

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance:

`31 * i == (i << 5) - i`

. Modern VMs do this sort of optimization automatically.

*NOTE : Primes are part of an older technique for hashing and there are supposedly better hashes which use more sophisticated techniques. I do not know the mathematical foundations for these, so cant vouch for them but its true that the prime number technique is quite old.
*

**The correct but longer explanation**

The reason 31 is chosen is because it is prime – not because it is odd. It so happens that primes are un-divisible by any other numbers. This includes 2 and this makes it odd too but the real reason, is its primeness and one other factor which we shall go into shortly .

So why a prime ?

Hashing works by allocating a value into one of the many storages, a hash has, for **fast retrieval** later on. The storage a hash has is also termed as buckets in comp literature, for reasons which will become clear later on.

Now, how does the hash identify which bucket it needs to store the value in? This is an important question, due to the property of hashes, which makes it compulsory that a hash be able to tell you in constant time (which is hopefully fast) in which bucket the value is stored in.

**The slooow hash find logic
**

for(i = 0; i < totalkeys; i++)
{
if (CheckForEquality(key[i], "Samuel"))
return i;
}
[/sourcecode]
This sort of sequential search, would cause the hash performance to worsen, directly dependent on the number of value it contains. In other words, you would have a linear performance cost (**O(n)**), which becomes progressively bad with larger and larger no of keys(n). The other complication is the actual type of the value you are dealing with. If you are dealing with strings and other such complex types, the no of checks or comparisons itself becomes prohibitive in terms of cost.

Now we have 2 problems

- Complex values which are difficult to compare
- Sequential searches are not fast and cannot give constant time lookups

**Solving the complex value problem**

The easy way out for this issue is to derive a way to **decompose complex values into a key or a hash that is easy to work with**. The easiest way to achieve this of-course is to generate UNIQUE numbers from your value. The number has to be UNIQUE since we want to distinguish one value from another. This UNIQUE side of things is where primes come in handy.

**Primes**

Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.

Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used.

However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about hashes without primes.

**Now why is 31 used?**

Ok, so now you have your unique identifier generated for each value. Now how do you allocate this value to a bucket or a container you have.

Lets say your container is a fixed array of 16 items. **The easiest way** to decide where to stick the value or the generated unique number, from now on referred to as the hash key, is to stick the key in the same location as its value.

So if your hash is 3 you stick your key in the location number 3. But of-course what would happen if the hash number is bigger than your container size. For this reason the initial or the very first hashing algorithms used to divide the generated hash key and stick the key in the location pointed to by the remainder. (This again necessitated that the no of storage locations in a hash was a prime number so as to generate unique remainder values).

So assuming your hash has 11 items, the hash key 12, would be stuck into 12%11 remainder 1 – 1st location.

**why is 31 used?**

Well, when you stick your keys in the containers, depending on the prime used, you would get a different key number and therefore a different distribution of the keys in your array.

So the same key “abc” would be a*31+b*31+c*31 for prime 31 and it would generate a different key for the abc with 29 – a*29+b*29+c*29.

Since the key produced is different, the data would go into a different location depending on the prime used.

Researchers found that using a prime of 31 gives a better distribution to the keys, and lesser no of collisions. No one knows why, the last i know and i had this question answered by Chris Torek himself, who is generally credited with coming up with 31 hash, on the C++ or C mailing list a while back.

**Collissions**

There are chances that certain strings might cause the same key to be generated. In such cases, the individual has storage is turned into a link list or some other type of storage that can store all the duplicate keys. This is the reason why the individual hash storage is called as a bucket.

**Better hashing algorithms**

But the modulo hashing algorithm we just looked at is extremely simplistic. There are much more complex and more faster hash algorithms that can be used. There are even tools that will generate the algorithm for you, given the type of keys you use.

Here are some fast and better distributing hash functions / modules if you wish to improve on the ones you have –

Bob Jenkin’s Hash and his Dr Dobbs Article on the same

3 cheers for better performance !!!

October 27, 2008

## Linked List Efficiency

Posted by computinglife under Data Structures, Programming | Tags: DataStructure |Leave a Comment

A person one year into the industry (comp sc degree and all) recently tried to reverse a linked list by copying the contents from one node to another. While I do admire the ingenuity in thinking to go against the conventional wisdom of thought, i felt that a certain lack of grasp of the underlying structure was to blame for this excessiveness in thinking.

Link list is one of the most sloppy data structure that can be used for organizing data. But it is one of the most convenient (coz it is sloppy and most dynamic) for re-arranging data.

Data when put in a link list looks something like this

- You can insert anything anywhere
- You can take out anything from anywhere without disturbing the overall structure (just as messy as before)
- You can take one item out and put it anywhere else.
- You can always view the top of the shelf as the start and the bottom of the shelf as the start – or vice-versa. Nothing major needs to be done to achieve this exept chaning the direction in which you “traverse”
- Any order present in the items inside would be self impossed (eg programmer considers everything on list A as “Things to be done” vs list B “Things already done”.

Compare it with say a vector which might look something like this

However make no mistake. Link lists can be improved in a variety of ingenious ways as no doubt all of us have found out during college years, to catalogue and find the items in it much faster. Their flexbility and removal of all order in ts structure allows creative minds to apply all sorts of ingenuity and bring out grand solutions for specific problems. So to speak, link list is a grand daddy for many super structures available for use (Binary Tree, RB Tree etc)

**Computer Science**

Once the weak analogy has softened the concept, let me try to address why really the link list is considered in-efficient and sloppy.

**Memory Cost aka Latency**

Link list concists of independent pieces of memory with links to other pieces, that allows one to JUMP from one piece to another much like a treasure hunt progresses.

Inside the computer memory this would be how the link list is arranged.

But the real memory which is but a collection of circuits on your computers motherboard is shown below. Where in it do you think are the addresses 10, 15 200 and 4000 going to be at?

These addresses will be mapped to some RANDOM position inside the physical memory, and would be in most likely situation be places quite far apart in different pieces of the memory circuits.

**Memory is slow**

The modern processors work at 3 Ghz and the best server memory modules (PC memory might be faster) can work at 800 Mhz or so. The processor would have to wait for a long time if it where to fetch the contents of the memory for every instruction it operates on. This would have effectively brought back the speed of your processor to the 800 Mhz Pentium III era.

Therefore jumping around in memory is physcally or rather computationally extremely costly. (note that there are caches that sits in between the CPU and memory in the hope of reducing cost. This doesnt help us either, unless the data that we required are situated close to each other)

**Implications of memory latency for link list**

- Traversing the link list, JUMPING from one item to another is expensive (memory wise ie latency wise ie slow)
- However re-arranging items is easy – all you have to do is change one link to point to another.
- There is extreme flexibility in adding items to the list as you do not need contigous area in memory to store the data. All you require are bits and pieces that can be used to fit things in.

**Code**

Here is a piece of code to perform a doubly link list (nodes have pointers to previous node as well as the next for obvious flexibility reasons in traversal) reversal in C.

void ReverseDoublyLinkedList(Node* & phead) { Node * pNode = NULL; if (!phead) { printf( "Bad head pointer"); return; } pNode = pHead; while (pTemp = pNode -> pNext) { /* switch the previous and the next */ pNode -> pNext = pNode -> pPrevious; pNode -> pPrevious = pTemp; pPrevNode = pNode; pNode = pTemp; } /* Handle the switching of the last node too */ pNode -> pNext = pNode -> pPrevious; pNode -> pPrevious = NULL; /* Change the head node such that the end becomes the new head */ pHead = pNode; return; }

ps : If you are curious about what the impressive looking equipment from the second picture is, they are a set of micro phones using in this MIT project