This project builds a functional web search engine that incorporates what we have learned in CIS555. To achieve this aim, we have the following goals in mind:

a) Our system is able to crawl a large corpus of web documents. (4*10^5 urls,200GB AWS S3)

b) Our system can process large amounts of data efficiently.

c) Our system can return accurate and meaningful results based on user search query.

d) Our system is robust and can be deployed on the cloud as a real product.

While details of scalability (of the project) can be found in our final report,

(1) we crawled 4*10^5 urls with 200GB data stored on AWS S3.

(2) indexer is on AWS RDS mysql wih 35-70GB invertedIndex (as well as tf-idf scores, summary statistics etc.).

Full name: Feng Xiang, Yezheng Li, Xinyu Ma, Shenqi Hu

SEAS login: fxiang, yezheng, xinyuma, hshenqi

Which features did you implement?

(list features, or write 'Entire assignment')

Entire assignment

Did you complete any extra-credit tasks? If so, which ones?

(list extra-credit tasks)

Process pagerank using Apache Spark

Any special instructions for building and running your solution?

(include detailed instructions, or write 'None')

None

Did you personally write all the code you are submitting (other than code from the course web page)?

[x] Yes

[ ] No

Did you copy any code from the Internet, or from classmates?

[ ] Yes

[x] No

Did you collaborate with anyone on this assignment?

[ ] Yes

[x] No

One of the screenshots (all saved screenshots are in report as well) is shown as following

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
code-maven		code-maven
report		report
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

minjohn/cis555-search-engine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages