Homework 3 - Which book would you recomend?

Task

The goal of this project was to experiment with crawling, parsing and to get confidence with different techniques regarding search engines, like how to retrieve results for a conjunctive query, how to score results with cosin similarity and tfidf, and so on so forth. Finally, it was asked to write both a recursive and a dynamic programming algorithm for the problem of longest increasing subsequence of a string.

We also decided to produce an original logo for our search engine! It does remind me of something, not sure what though...

Pronunciation: goo·gs

Usage

In the repository, it is included requirements.txt, which consists in a file containing the list of items to be installed using conda, like so:

conda install --file requirements.txt

Once the requirements are installed, you shouldn't have any problem when executing the scripts. Consider also creating a new environment, so that you don't have to worry about what is really needed and what not after you're done with this project. With conda, that's easily done with the following command:

conda create --name <env> --file requirements.txt

where you have to replace <env> with the name you want to give to the new environment.

Repo structure

The repository consists of the following files:

data:

This directory contains both the data retrieved just after the crawling part (parsed_books.tsv) and the data after the preprocessing and cleaning part (clean_data.csv).
images:

This directory contains images for the search engine logo and for part of the recursive complexity proof. Just ignore this.
indexes:

This directory contains the pickle objects for the vocabulary, the inverted index dictionary and the tfidf inverted index.
book_links.txt:

A txt file containing the links for all the html urls.
data_collector.py:

A Python script containing the functions to download the txt file and the html pages for the books.
functions.py:

A Python script containing all the functions used in the main.ipynb, apart from the data collection and parsing parts.
main.ipynb:

A Jupyter notebook which provides the solutions to all the homework questions. The notebook just contains the answers; the only code provided here is the one for exercise 5, for which the answer is actually the code itself.
parser.py:

A Python script containing the functions to parse the html pages and extract the tsv file.
requirements.txt:

A txt file containing the dependecies of the project; see the usage part for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Homework 3 - Which book would you recomend?

Task

Usage

Repo structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
images		images
indexes		indexes
.gitignore		.gitignore
README.md		README.md
book_links.txt		book_links.txt
data_collector.py		data_collector.py
functions.py		functions.py
main.ipynb		main.ipynb
parser.py		parser.py
requirements.txt		requirements.txt

mikcnt/adm-homework-3

Folders and files

Latest commit

History

Repository files navigation

Homework 3 - Which book would you recomend?

Task

Usage

Repo structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages