Skip to content

Key VS Value Hashing

Both Key and Value hashing aim to convert strings or any other data into a unique number. However, both do it in different ways to be optimal for their purposes.

Value / Data hashing

Aims to encrypt data so it is a completely unique number of which is relatively impossible to work backwards (number to string). And there should be no encoding collisions.
Since these hashes are normally stored, the length of them is not too much of an issue.


Key Hashing

Allows objects to be stored as an array.
Say that you have a DNS object {'google.com': '8.8.8.8', 'facebook.com': '31.13.95.36'}.
If you have a ton of domain names, then to find the specific IP of a server requires you to scan all of the keys to find the correct one then return its result.
Thus, the larger than your table gets, your request will increase in time exponentially.

Thus, they encode the Key value into a number, then store it as an array.
Therefor when you are trying to find google.com's IP, you just re-encode the string, then you know the exact index of the result.
No lookup time required.

A method on encoding that would be to literally convert it to a number.

Hex 67 6f 6f 67 6c 65 2e 63 6f 6d
Decimal 5.1679268198614805e+23

Of which as you can tell is longer than the longest possible integer, thus we would have a problem indexing it.
So, we can try and restricting the character spacing and making our own string to number algorithm.

let chars = 'abcdefghijklmnopqrstuvwxyz-_.';

function encode(str){
  let res = 0;

  for (let i=0; i<str.length; i++){
    let j=chars.indexOf(str[i]);
    if (j === -1){
      throw Error('Invalid character:', str[i]);
    }

    let exp = Math.pow(chars.length, i);

    res += j * exp;
  }

  return res;
}

encode('google.com'); //returns 181140446280695
This is a better value since we can actually index it, and as long as the string isn't just made up of as it will be fine (it would be like having a number which is just a bunch of zeros). Also, any trailing as will not be accounted for since this is a little endian encoding, and it would be like having a bunch of zeros in front of your number. This could be fixed by adding 1 to j. But this isn't the point.


To store google.com we need to have at least 181140446280695 blank items before this one, or else the index wouldn't actually match.
Thus, we are not getting into the realm of intentional hashing collisions.

Within this algorithm (excluding the a problem) there will be no two strings that make the same result, thus you will never have any conflicting pointers (having two sites attempting to use the same index).
However, we can case match out an algorithm to better fit our problem.


If we were storing an address book, our key values would only include valid names.
Thus, we would not need to be concerned with the fact that aaa, and baaaa may have the same index.
As those two keys will never actually occur within the system.

So now we can realize that we only need to care about valid data.

This does leave problems for intentionally malicious users to intentionally make conflicts within the system, thus destroying data

let chars = 'abcdefghijklmnopqrstuvwxyz';

function encode(str){
  str = str.toLowerCase();
  let res = chars.indexOf(str[0])+1;

  for (let i=1; i<str.length; i++){
    res *= chars.indexOf(str[i])+1;
  }

  return res;
}

// Test a few names
let names = ['Alex', 'Anthony', 'Carla', 'Casandra', 'Jaxon', 'Jackson', 'Jack', 'Jimmy', 'Jim', 'John'];
for (let name of names){
  console.log(name, encode(name));
}

Name Decimal
Alex 1440
Anthony 11760000
Carla 648
Casandra 57456
Jaxon 50400
Jackson 1316700
Jack 330
Jimmy 380250
Jim 1170
John 16800

As you can see we have no conflicts without sample data, and our biggest value is only 11760000.
Thus, that is the length of our array to store these people efficiently.


We can also push our system to the extreme, getting a list of words and testing for conflicts.
After executing it (code) we get the results;

293555 collided of 370093
  79.31925218796356% 
Which for our little algorithm whipped up in a couple of minutes isn't too bad, especially since the algorithm is for names and less than 30% of the words in the list names, we can say it was almost successful, but heavily not recommend for real-world situations.
We can also conclude that it only made 76538 unique numbers, of which is actually very bad for our use case since the largest result index was 6.868313676231082e+30.
However, it is still important to remember that this is still a much better result than our first hashing algorithm of which couldn't even index an eight-character string let alone a 10.
Also, our first algorithm given the same problem has only 1.176% collision rate, but we cannot actually store any reasonable size strings.


All in all, you can see why it is valuable to have specific case-based hashing functions.
Possible solutions to some of the problems we have seen are using hashing to predict the location but also storing the key along with the data, thus if a collision occurs the new value can be stored in the next empty index, of which can then be searched for when the predicted location does not match. Thus, you can get the speedups of hashing while still storing a large variety of key values.

Another possible solution which can cause unpredictable results is training a recursive neural network of example keys to generate values that do not conflict with one another but still use the smallest amount of indexes possible. This effect is like the freeloading of a general solution of which may not always work and will increase initial index prediction processing.

Foot Note

If you are woundering why there is such a large collision rate with an algorithum that appears to not collide with regular words; it is because of number overflows.
If we make a generic algorithm for how many bytes will be needed to store a decimal value based off of string length; $$

\[ 24 ^ {length} \div 256 \]

So to overflow a 32bit number you only need a length of;

\[ \begin{align*} {24} ^ {length} \div 256 &>= 4 && \\ {24} ^ {length} &>= 4 * 256 && && \\ length * log(24) &>= log(1024) && \\ length &>= \frac{ log(1024) }{ log(24) } && \\ length &>= 2.18 && \\ \end{align*} \]

References

  1. James Curran (NCSS)