Wikipedia Search Engine

About

A Search Engine that is designed based on Block sort based indexing. It uses TF-IDF statistics to calculate the relevance of a page and ranks them according to this score for query results.

Usage

Indexer

The indexer.py script takes in wikipedia dump file path as an argument.
Format : python index.py <wikipedia-xml-dump-path>
Query

Run the script query.py, it will ask for your query, enter your query, and top ten most relevant results with their title and url will be printed for the query. A prompt will then ask you whether you would like to query again, respond accordingly.
Make sure that query.py is in the same folder as data\ folder which was created by indexer.py during the index generation process.

There is also a feature using which you can limit your queries to only look at titles, categories, or body of a page using tags -t, -c, and -b respectively.
To use tags simply put the corresponding tag before your query, e.g. Enter your query: -t <your-query>.

Query Example:

Work flow of Scripts

Indexer
- The indexer script, when run creates a folder named data/ in the same directory as itself, having subfolders indexes/, posting_lists/ and temp/ (Which will be deleted later).
- It also creates two files in data/ named titles.txt (contains titles of all pages) and titles_location.pkl (Dump of a data structure containing locations of titles in title.txt). The indexer uses block sort based indexing and therefore create a lot of small indexes and posting files for each tag while parsing the wikipedia corpus. These files are stored in temp\ folder.
- The indexer parses the corpus and extracts information about the title, catagories and body(text) of every page, it then cleans the text, and stems it using PorterStemmer from nltk pyhton library.
- The text is then tokenized and stopwords removed, the words are then indexed in a dictionary and its corresponding posting list is also created (Different for each tag).
- When the indexer has completed parsing the wikipedia corpus then it moves on to combining them and creating a single file for index and posting lists for each tag.
- When all files are merged, these are placed in indexes and posting lists with their respective tag names. Rather than the count of words, there TF-IDF scores are stored in posting lists for every word in every document this time (again different files for each tag).
- At last the script deletes the temp folder and all files in it as they are no longer needed.
Query
- When a user enters his/her query, the script first checks whether it contains a tag or not (e.g. -t), if it does then it strips out the tag and processes according to the tag.
- Depending upon the tags, the script looks for query words in indexes of all tags (if no tag was specified in the query) or of a specific tag (if tag was specified) and for all candidate pages it adds up TF-IDF scores multipled by approopriate weights according to the tags.
- It then ranks those documents according to their scores and prints top ten most relevant resutls with their titles and urls.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
README.md		README.md
indexer.py		indexer.py
query.py		query.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Search Engine

About

Usage

Indexer

Query

Query Example:

Work flow of Scripts

Indexer

Query

About

Releases

Packages

Languages

Bufftowel/Wikipedia-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Search Engine

About

Usage

Indexer

Query

Query Example:

Work flow of Scripts

Indexer

Query

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages