This project is a prototype for a text-based search engine that ranks documents using the TF-IDF (Term Frequency-Inverse Document Frequency) scoring algorithm. The current implementation parses RSS feeds from the Times of India and BBC news websites, stores the articles' content in text files, and then uses these files to create an index and a vector space model. Once the index is ready, users can query it to get a list of documents ranked by their TF-IDF scores.
- RSS Parsing: Parses RSS links from the Times of India & BBC news websites and stores the articles' content in text files.
- Index Creation: Parses the text files to create an index and a vector space model for each document.
- TF-IDF Ranking: Queries the index and returns a list of documents ranked according to the highest TF-IDF score.
rss_feed_scraper_toi.py
: Script for parsing RSS feeds and storing article contents in text files.rss_feed_scraper_bbc.py
: Script for parsing RSS feeds and storing article contents in text files.create_index.py
: Script for creating an index and vector space model from the text files.query_index.py
: Script for querying the index and ranking documents based on TF-IDF scores.source_files/
: Directory where the text files containing article contents are stored.index.txt
: file where the created index is stored.doc_vector_space.txt
: file where the vector magnitude for unique terms in each file is stored.
Run the rss_feed_scraper_toi.py
script to parse RSS feeds from the Times of India website and store the articles' content in text files.
python3 rss_feed_scraper_toi.py
Run the create_index.py
script to create an index out of the text files stored.
python3 create_index.py
Run the query_index.py
script with the appropriate query that you want to search. If there is any particular term that is supposed to be emphasized, wrap it in double quotes
The source_files directory can take up a lot of memory as per the current implementation. Optimizations are on the way. Stay tuned :)