Project - CS 582 Information Retrieval

Tejas Dhananjay Rajopadhye (UIN - 675873639)

About

This project report outlines the specification of the Search Engine made as a final project for CS 582 - Information Retrieval. The goal of this project is to design a Web Search Engine that will consist of roughly three stages - web crawler, preprocessing and indexing results based on a weighing scheme. The web crawler will start from a UIC domain link and will crawl web pages. Once the webpage is captured it will be stored. This webpage will then be preprocessed by the processing system developed during Homework 1 and the results of this will be weighted according to TF-IDF weighing scheme. This TF-IDF data along with similarity measures like Cosine Similarity will then be used to provide results to the user about the query that the user provides.

Setting up the project

Note - This project was created and testing the Pycharm IDE

First Create Folder called "RetrievedDocs" inside the root directory
Install the dependencies required for this project. Please type in 'pip install requests', 'pip install BeautifulSoup', 'pip install numpy', 'pip install nltk'. If prompted while running the searchRetrivel to download the stemmer files please install and download them accordingly.
In order to start scraping, please run the crawler.py file by issuing command 'python crawler.py'.
For directly working on search queries please type 'python searchRetrival <path_to_RetrievedDocs Directory>'

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
RetrievedDocs		RetrievedDocs
Readme.md		Readme.md
Tejas_Rajopadhye_CS582_ProjectReport.pdf		Tejas_Rajopadhye_CS582_ProjectReport.pdf
crawler.py		crawler.py
mapDocToURL		mapDocToURL
searchRetrieval.py		searchRetrieval.py
visitedLinks		visitedLinks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project - CS 582 Information Retrieval

Tejas Dhananjay Rajopadhye (UIN - 675873639)

About

Setting up the project

About

Releases

Packages

Languages

TejasRGitHub/Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Project - CS 582 Information Retrieval

Tejas Dhananjay Rajopadhye (UIN - 675873639)

About

Setting up the project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages