Text-Based Search Engine Prototype

Overview

This project is a prototype for a text-based search engine that ranks documents using the TF-IDF (Term Frequency-Inverse Document Frequency) scoring algorithm. The current implementation parses RSS feeds from the Times of India and BBC news websites, stores the articles' content in text files, and then uses these files to create an index and a vector space model. Once the index is ready, users can query it to get a list of documents ranked by their TF-IDF scores.

Features

RSS Parsing: Parses RSS links from the Times of India & BBC news websites and stores the articles' content in text files.
Index Creation: Parses the text files to create an index and a vector space model for each document.
TF-IDF Ranking: Queries the index and returns a list of documents ranked according to the highest TF-IDF score.

Project Structure

rss_feed_scraper_toi.py: Script for parsing RSS feeds and storing article contents in text files.
rss_feed_scraper_bbc.py: Script for parsing RSS feeds and storing article contents in text files.
create_index.py: Script for creating an index and vector space model from the text files.
query_index.py: Script for querying the index and ranking documents based on TF-IDF scores.
source_files/: Directory where the text files containing article contents are stored.
index.txt: file where the created index is stored.
doc_vector_space.txt: file where the vector magnitude for unique terms in each file is stored.

Usage

Step 1: Parse RSS Feeds

Run the rss_feed_scraper_toi.py script to parse RSS feeds from the Times of India website and store the articles' content in text files.

python3 rss_feed_scraper_toi.py

Step 2: Create the index

Run the create_index.py script to create an index out of the text files stored.

python3 create_index.py

Step 3: Query the index

Run the query_index.py script with the appropriate query that you want to search. If there is any particular term that is supposed to be emphasized, wrap it in double quotes

Note

The source_files directory can take up a lot of memory as per the current implementation. Optimizations are on the way. Stay tuned :)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
prototype		prototype
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-Based Search Engine Prototype

Overview

Features

Project Structure

Usage

Step 1: Parse RSS Feeds

Step 2: Create the index

Step 3: Query the index

Note

About

Releases

Packages

Languages

lakshyajit165/infohive_search_engine

Folders and files

Latest commit

History

Repository files navigation

Text-Based Search Engine Prototype

Overview

Features

Project Structure

Usage

Step 1: Parse RSS Feeds

Step 2: Create the index

Step 3: Query the index

Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages