Assignment 6 A Hash Table Programming assignment

 

due at the end of exams Dec 20th.

Please email your code, results and the answers to the questions below.

The following is an example implementation of a hash table is on the class website.

*        itemtype.h

*        table.h (sets up an abstract base class for tables)

*        hash.h (derives a hash table class)

*        hash.cpp

*        hash.txt (text file of data to use)

*        hashmake.cpp (creates a hash table)

*        hashread.cpp (reads data from a hash table)

 

There are two main programs in this example. Both of them use the .h files and the hash.cpp file.

 

The main function for the first program is in hashmake.cpp. This program builds a hash table from the data in the hash.txt file. The hash table is placed in a file named hash.dat.

The second program has its main function in the hashread.cpp file. This program allows the user to interactively look up data in the hash table. It reads this data from the hash.dat file, of course. Note that this example uses coalesced chaining (a linked list inside the table with all items after the first in an overflow area of the table).

 

Let's see how it works.

The file is organized as a binary file of records.

Each record has an Info and a Next field.

The Next field will contain the record number for the next record in the linked list (when we have to chain nodes together) or a -1 as a fake NULL pointer.

Record 0 (the first record in the file) is not used to hold node data at all. Instead, it holds in Info.KeyField the number of items in the table. That way this count is stored permanently, along with the file. When an empty table is first created, the entire file is filled with empty nodes.

         Thus, the special record 0, followed by empty nodes 1 through MaxFile = 150 are written to the file. These nodes are marked as empty by placing 0 in the KeyField locations. (This assumes that 0 is not a normal KeyField value.)

         The regular part of the table includes nodes 1 through Prime = 101, and the overflow area is from node 102 to node MaxFile = 150. A field of the HashTableClass object, called OverflowIndex, is used to keep track of the next free record for overflow use. (Thus OverflowIndex is initially set to 102. If it reaches 151, the overflow area is full.)

The following is a drawing of a HashtableClass object, showing the five data fields.
Note that the OpenMode field contains the character r or w, indicating whether the table has been opened in read or write mode. NodeSize is the number of bytes per node. Since this value is often needed, it is convenient to store it in a field of the object. The DataFile field is an fstream object containing information about the file where the table's data is held. The details of an fstream object are beyond what we will study here, but include information such as the current position in the file.

[hash table]

Note that when we open a hash table in write mode, we really open the underlying data file, hash.dat for both input and output. We also specify that it is a binary file. Depending on your compiler, you may need to use ios::trunc as was done here. This tells the program to truncate (erase) any existing file by the same name when the open is carried out.

The data for this program is read from a text file, hash.txt, that contains an integer (long) followed by a name (character string), with these items on separate lines. To make it easy to read in the strings, no spaces are used. Rather, underscores are used where spaces would normally appear, as in Bill_Clinton, the name that goes with key value 7.

If you run the hashmake program from within the Visual C++ environment you need to be sure that the hash.txt file is in the project directory for this program. If instead you want to run the program by double clicking on the hashmake.exe file in MyPrograms, then you want hash.txt to be in the same directory (usually the Debug directory) that contains hashmake.exe.

The hash function used is hash(key) = (key % prime) + 1. The + 1 is there because record 0 is not used for table data. Thus the hash function has to produce the integers from 1 to prime. When inserting an item in the table, the item's key value is hashed to produce the desired location. The record at that location is read and its Info.KeyField examined to see if it is 0 (which indicates an unused record). If so, a record containing the item is written to the file at this location. If, however, the location produced by the hash function is in use, we insert the item at the location given by OverflowIndex (unless OverflowIndex > MaxFile, which means that the overflow area is full). The new record written to this location contains item and a Next field containing the record number from the Next field of the record at the location produced by the hash function. Finally, the record at the location produced by the hash function is rewritten so as to contain in its Next field OverflowIndex, the location of our newly inserted node. OverflowIndex is then incremented, of course. Note that after writing the above 2 records to the file, we have correctly linked the new record into the linked list of records hashing to that same location, with the new record added as the second record of the list.

The following drawing shows our hash table after 4 items have been inserted. One item hashed to location 1 and was inserted there. The other 3 items hashed to location 3, so that two of them had to be placed in the overflow area. Note how these three are linked together in a linked list via the Next field values. Also note how the number of items (4 in this case) is stored in the NumItems field in the object, as well as in node 0 in the file. (However, node 0 is only updated when we finish using the table. This is so that we do not waste time repeatedly writing to node 0 just to update that counter.)

[hash table]

Next, let's consider the second program in this example. It allows us to interactively look up data from the hash table. When a table is first opened for reading, the special record 0 is read in so that the number of items in the table can be discovered by examining Info.KeyField. To do a lookup of a key value, we first compute hash(key) and read in the record with that number as its location. If the key in this record matches, we are done. If not and the Next field of this record contains -1 (used as a fake NULL pointer), then this key is not in the table. Otherwise, the Next field number is the location for the next record that hashes to this location. We read this new record (which will be in the overflow area) and see if the key matches the target key. If not, we check the Next field. If we have -1 we report that the key cannot be found. Otherwise, we read the record whose location is given by this Next field, check its key value, etc. The search ends when we either find a matching key or get a -1 as the Next pointer.

Assignment.

1) Modify this program so that it uses open addressing hash table using linear probe. Determine how the file for the table should be organized for this new method.

2) Describe how you choose the initial hash function for the above?

3) Modify your program which returns the number of collisions encountered when hashing data[0], data[1], data[2], ...,data[n-1] into a hash table. and compare the number of collisions of the two different functions.

4) What are the various advantages and disadvantages of the various collision-resolution strategies?

Extra Credit: open addressing hash table using quadratic probe function