Skip to content

Multi-threaded text file indexing command line application.

Notifications You must be signed in to change notification settings

prafful1/File-Indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

File-Indexer

Multi-threaded text file indexing command line application.

System Info:

  • Linux 4.16.0-rc4 #3 SMP Sat Mar 10 02:42:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
  • gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)

Configuration:

  • git clone https://github.com/prafful1/File-Indexer.git
  • cd File-Indexer
  • Run "make"
  • Generates "bin" and "obj" directory.
  • Generates "file_indexer" binary in "bin" folder.
  • Run following command in "bin" folder: "./file_indexer <Directory_Path>".
  • Run "make clean" to delete files from "obj/" and "bin/" folder.

Output:

  • Input Command: ./file_indexer /root/text_files-dir

  • Sample Output:

  • ######################################################

  • Scanner Thread started

  • ######################################################

  • Worker Thread 1 started

  • ######################################################

  • Worker Thread 2 started

  • ######################################################

  • Worker Thread 3 started

  • File Path extracted from message queue /root/text_files-dir/file2.txt

  • ######################################################

  • Worker Thread 4 started

  • File Path extracted from message queue /root/text_files-dir/pqrs/pqrs2/zmap.txt

  • ######################################################

  • File Path extracted from message queue /root/text_files-dir/pqrs/text.txt

  • File Path extracted from message queue /root/text_files-dir/pqrs/file3.txt

  • File Path extracted from message queue /root/text_files-dir/pqrs/file.txt

  • File Path extracted from message queue /root/text_files-dir/b.txt

  • File Path extracted from message queue /root/text_files-dir/file.txt

  • File Path extracted from message queue /root/text_files-dir/abcd/e.txt

  • File Path extracted from message queue /root/text_files-dir/abcd/c.txt

  • File Path extracted from message queue /root/text_files-dir/abcd/file3.txt

  • Word: the, Frequency: 95

  • Word: and, Frequency: 70

  • Word: to, Frequency: 67

  • Word: of, Frequency: 50

  • Word: that, Frequency: 49

  • Word: a, Frequency: 48

  • Word: in, Frequency: 38

  • Word: is, Frequency: 33

  • Word: about, Frequency: 30

  • Word: authors, Frequency: 25

Design:

  • 1 scanner thread that recursively iterates through <Directory_Path> and searches for all ".txt" files.
  • Scanner thread adds all paths to message queue.
  • Message Queue is implemented with a singly linked list.
  • A Scanner thread acquires write lock on linked-list and adds a new node to the tail of list with ".txt" file path in it.
  • Linked-list is used as message queue because it can dynamically grow in size.
  • 4 worker threads, read file paths from Message Queue in parallel.
  • Each worker thread acquires write lock on message queue and reads file path from top. Worker thread removes top element and reads file path.
  • After node is removed from front, lock on message queue is released.
  • Each worker thread reads a file provided to them, tokenizes words in each line on the basis of following delimeter " .,-".
  • Each word is added to hash_map. If word is not available in hash_map, then word is added to map with counter set to 1. If word is already available, counter is incremented.
  • I provided my own implementaion of thread safe Hash Map.
  • Every time, a word is added to map, a write lock is aquired and then added.
  • In case of collison, next available is being looked and word is added to that bucket.
  • Current implementation of hash map only supports a maximum of 1000000 words.
  • After all worker threads are done writing into hash map, they exit.
  • main thread runs through hash_map and searches for Top 10 words with maximum frequency and prints that on console.

Current Implementation:

  • 1 Scanner Thread. Can be increased.
  • 4 Worker Threads. Can be changed.
  • Hash_Map can store a maximum of 1000000 unique words. Can be changed.

About

Multi-threaded text file indexing command line application.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published