Key VS Value Hashing
Both Key and Value hashing aim to convert strings or any other data into a unique number. However, both do it in different ways to be optimal for their purposes.
Value / Data hashing
Aims to encrypt data so it is a completely unique number of which is relatively impossible to work backwards (number to string). And there should be no encoding collisions.
Since these hashes are normally stored, the length of them is not too much of an issue.
Key Hashing
Allows objects to be stored as an array.
Say that you have a DNS object {'google.com': '8.8.8.8', 'facebook.com': '31.13.95.36'}
.
If you have a ton of domain names, then to find the specific IP of a server requires you to scan all of the keys to find the correct one then return its result.
Thus, the larger than your table gets, your request will increase in time exponentially.
Thus, they encode the Key value into a number, then store it as an array.
Therefor when you are trying to find google.com
's IP, you just re-encode the string, then you know the exact index of the result.
No lookup time required.
A method on encoding that would be to literally convert it to a number.
Hex | 67 6f 6f 67 6c 65 2e 63 6f 6d |
---|---|
Decimal | 5.1679268198614805e+23 |
Of which as you can tell is longer than the longest possible integer, thus we would have a problem indexing it.
So, we can try and restricting the character spacing and making our own string to number algorithm.
let chars = 'abcdefghijklmnopqrstuvwxyz-_.';
function encode(str){
let res = 0;
for (let i=0; i<str.length; i++){
let j=chars.indexOf(str[i]);
if (j === -1){
throw Error('Invalid character:', str[i]);
}
let exp = Math.pow(chars.length, i);
res += j * exp;
}
return res;
}
encode('google.com'); //returns 181140446280695
a
s it will be fine (it would be like having a number which is just a bunch of zeros). Also, any trailing a
s will not be accounted for since this is a little endian encoding, and it would be like having a bunch of zeros in front of your number. This could be fixed by adding 1
to j
. But this isn't the point.
To store google.com
we need to have at least 181140446280695
blank items before this one, or else the index wouldn't actually match.
Thus, we are not getting into the realm of intentional hashing collisions.
Within this algorithm (excluding the a
problem) there will be no two strings that make the same result, thus you will never have any conflicting pointers (having two sites attempting to use the same index).
However, we can case match out an algorithm to better fit our problem.
If we were storing an address book, our key values would only include valid names.
Thus, we would not need to be concerned with the fact that aaa
, and baaaa
may have the same index.
As those two keys will never actually occur within the system.
So now we can realize that we only need to care about valid data.
This does leave problems for intentionally malicious users to intentionally make conflicts within the system, thus destroying data
let chars = 'abcdefghijklmnopqrstuvwxyz'; function encode(str){ str = str.toLowerCase(); let res = chars.indexOf(str[0])+1; for (let i=1; i<str.length; i++){ res *= chars.indexOf(str[i])+1; } return res; } // Test a few names let names = ['Alex', 'Anthony', 'Carla', 'Casandra', 'Jaxon', 'Jackson', 'Jack', 'Jimmy', 'Jim', 'John']; for (let name of names){ console.log(name, encode(name)); }
Name | Decimal |
---|---|
Alex | 1440 |
Anthony | 11760000 |
Carla | 648 |
Casandra | 57456 |
Jaxon | 50400 |
Jackson | 1316700 |
Jack | 330 |
Jimmy | 380250 |
Jim | 1170 |
John | 16800 |
As you can see we have no conflicts without sample data, and our biggest value is only 11760000
.
Thus, that is the length of our array to store these people efficiently.
We can also push our system to the extreme, getting a list of words and testing for conflicts.
After executing it (code) we get the results;
293555 collided of 370093
79.31925218796356%
We can also conclude that it only made
76538
unique numbers, of which is actually very bad for our use case since the largest result index was 6.868313676231082e+30
.However, it is still important to remember that this is still a much better result than our first hashing algorithm of which couldn't even index an eight-character string let alone a 10.
Also, our first algorithm given the same problem has only
1.176%
collision rate, but we cannot actually store any reasonable size strings.
All in all, you can see why it is valuable to have specific case-based hashing functions.
Possible solutions to some of the problems we have seen are using hashing to predict the location but also storing the key along with the data, thus if a collision occurs the new value can be stored in the next empty index, of which can then be searched for when the predicted location does not match. Thus, you can get the speedups of hashing while still storing a large variety of key values.
Another possible solution which can cause unpredictable results is training a recursive neural network of example keys to generate values that do not conflict with one another but still use the smallest amount of indexes possible. This effect is like the freeloading of a general solution of which may not always work and will increase initial index prediction processing.
Foot Note
If you are woundering why there is such a large collision rate with an algorithum that appears to not collide with regular words; it is because of number overflows.
If we make a generic algorithm for how many bytes will be needed to store a decimal value based off of string length;
$$
So to overflow a 32bit number you only need a length of;
References
- James Curran (NCSS)