CMPT 225 Assignment 4: Hash Tables and the Dictionary ADT

The provided code word_frequencies.cpp is complete. This uses your hash table to compute word frequencies according to the following algorithm.

for each word (w words) // Each word lookup word in hash table if it is present, increment its value by 1 else insert it into the hash table end keys = array of all keys (words) in the hash table for each key lookup count in hashtable print key + ": " + count end
./word_frequencies data/1400.txt ./word_frequencies data/pg2600.txt
ch0 * 32n-1 + ch1 * 32n-2 + . + chn-2 * 321 + chn-1 * 320

where ch0 is the left most character of the string, characters are represented as numbers in 1-26 (case insensitive), and n is the size of the string. For example the string "cat" has the value:

3 * 322 + 1 * 32 + 20 = 3124

Part 2: Indexing documents (bonus marks)

The provided code index_directory.cpp is complete. It uses a hash table to allow indexing documents, finding all documents that contain a certain substring.

// Preprocessing to build the index for each document for each word in the document add document to the list of documents for this word end end // Answer queries while true ask user for query word find all documents containing word end

InvertedIndex is similar to the first HashTable. However, instead of storing integer counts, it should store lists of document names. This is done in the methods InvertedIndex::add and InvertedIndex::lookup. You can modify how these are implemented if you wish, e.g. with a linked list, a vector, a set, etc. You might need to modify index_directory.cpp based on your decisions. This data structure is called an inverted index, and is a standard, very useful data structure for document retrieval. Modern search engines essentially use this (with many details). You can use the program document_index to find documents in a specified directory. document_index takes a directory as a parameter, and builds an inverted index for all files in all subdirectories. It can further limit the index to specified file types (e.g. just .h and .cpp). After building the index, the user can query for different words.

> ./index_directory . Found a file: ./ii_debug Found a file: ./ii_debug.cpp Found a file: ./index_directory Found a file: ./index_directory.cpp Found a file: ./InvertedIndex.cpp Found a file: ./InvertedIndex.h Found a file: ./InvertedIndex.o Found a file: ./Makefile Indexed . found 0 subdirectories and 8 files built index with 789 keys Enter a word to search for, q to terminate break Searched for break found 1 occurrences: Enter a word to search for, q to terminate for Searched for for found 4 occurrences: Enter a word to search for, q to terminate q

Assessment and Submission

Submission

unzip part2.zip cd part2 make

Assessment