Relational Table DBMS
Recommended: UTF-8 Encoding; Two's Complement Integer
This article will not be discussing;
- Data Encryption
- Data replication / distribution
- Query Languages
- Sorting
- Caching
- Data Conflict resolution
- Optimizing disk read writes
Since they are all just extra layers of processing on top of what is discussed here, and it can be discussed separately
Tables are essentially just a list of tuples (aka records, or rows), of which are stored into a file. However, tables appear to be 2D how does one store it into a linear file.
First of all, we need to break down a table into its components.
Attributes
These are the individual fields of a table. They are of set size and length of which makes our job of converting the dataset into a file much easier.
Since we know the length of each attribute we can literally just store them one after another, and then read back just the bytes of which belong to each attribute as they are needed.
Below is an example of finding the slice points to get a specific attribute within an entire row.
let start = 0; // In Bytes
let end = 0; // In Bytes
for (Attr in Attributes){
end += Attr.size;
if (Attr.name == field){
break;
}
start = end;
}
Data Types
Some common table data types and their sizes.
* Float: A signed number of which can include non-whole numbers. Made up of 32Bits
* Double: A signed number of which can include non-while numbers as well as number impractical to be stored as ints 6*(10)^(56)
. Made up of 64bits
* String: An array of characters
* Int: An Integer value encoded using two's complement integer
* Int8: An integer made up of 8 bits
* Int16: An integer made up of 16bits
* Int32: An integer made up of 32bits
* Int64: An integer made up of 64bits
* UInt: A positive only integer value
* UInt8: An unsigned integer made up of 8 bits
* UInt16: An unsigned integer made up of 16bits
* UInt32: An unsigned integer made up of 32bits
* UInt64: An unsigned integer made up of 64bits
* Boolean: A true/false value of which takes up a whole byte since it is in the smallest division of usable storage space.
* Date: The method of storing dates can vary from system to system. You could encode your dates using 32bit ints and UNIX time, or you could use a string and storing it using ISO 8601
Tuples
Since we now know how to read attributes from a tuple, we can use a similar approach to read a tuple from a table.
First of all, we need to find the size of each tuple
let tupleSize = 0;
for (let Attr in Attributes){
tupleSize += Attr.size;
}
Now we can read the tuple in the exact same way
Primary and Composite Keys
Keys are a way of uniquely identifying individual rows.
A primary key is a single attribute of which is unique for every row within the table (i.e. ID).
A composite key is when multiple fields in combination are used to uniquely identify a row (i.e. First & Given Name).
Both of these key systems can be implemented in multiple ways.
Depth Key
This is a simple approach where the system reads just the key values of each row until it finds a match with the request, then it returns the whole row.
Thus, the time is taken to find a given tuple increase linearly.
Key Hashing
The concept of Key Hashing is to convert a given value into a unique number (i.e. James => 1; Bob => 2),
this means that you can hash a given key to then get the index of the row belonging to that key.
Thus, this method can vastly improve the time it takes to get data from the table since it does not need to scan each tuple to see if it's key matches.
However, no hashing algorithm is perfect and there will be trade-offs for speed to performance because you will end up with empty indexes between the data of which you have stored.
I.E. if you hash 'Bob' => 3; and 'Sally' => 54 that they are the only people in your database you can see that you will now need to store index 1-2 & 4-53 to ensure that the hashing algorithm points to the correct tuple. Since you always use this hashing function to get the index than just read of the index the time taken to read a row is always a constant value, thus being optimal for large databases where size limitations are not as much of an issue.
Referencing
This allows data to be connected almost directly to allow the storing of more complex data.
For instance, instead of storing a client's address with every online order they make you can instead just store the client ID as a reference then read the address form the client table.
Key Referencing
Uses the systems previously described to find the target row.
If it is a Composite key then you will need to store all key values for the reference to work.
Exact Referencing
Instead of needing to re-find the target row every time, the reference is the exact index of the row. Thus, no extra processing is required to find the target tuple, instead, it can just be read off.
Compacting a Table
This task removes any empty rows from the database one after another to decrease the size of the table. This operation while space effective will take time, and during processing, no data can reliably be read or written to the table for fear or corruption. Thus, it is a highly expensive processing task.
Also, if you are using Key Hashing or Exact Referencing for performance speedups it will break the entire system reading incorrect tuples, and even possibly attempting to read out of the table its self.
Thus, it has become a rather out-dated practice only ever really used of archives or personal small-scale databases.
Foot Note
Since entire tables can be bigger than the RAM of a single computer all modern DBMSs never read the whole table at once. Instead, they read it as a stream of data.
For instance, you have an input buffer, you pipe all of your data from your read stream into that, then once you have enough data to make up a tuple, you remove that first tuple amount of data from the buffer and run your operation on it, then you can pipe your result to another file.
Thus, you can now work with massive datasets and have massive query results. Now when your operation is done you can just pipe that entire output file to where your result needs to go.
Also, it is important to know that this output cache file is not necessary, for instance, you could instead write directly to a table or pipe your results back to whatever requested the event in the first place.
References
- Information Processes and Technology (The HSC Course) - Samuel Davis
- CS186 Berkeley - Lecture 1