Cosine Similarity

CSCI 49376: Big Data Technology

Authors: Liulan Zheng, Yiheng Cen Feng

Overview

This program computes Cosine Similarity of a given term and the other terms using MapReduce algorithm and Spark implementation. Output will be sorted by cosine similarity in descending order.

We implemented it using two methods. similarity_slow.py produces a matrix using cartesian(), which is a slow approach and can leads to memory error. similarity_fast.py calculates cosine similarity without making a matrix by just comparing the document ids. This approach is faster and more efficient.

Requirements

Python
Apache Spark
PySpark
```
 pip install pyspark
```

Run

spark-submit similarity_fast.py <filename> <query_term>

Output will be partitioned and saved in output/. Make sure you delete output/ before running the program again.

P.S.

To simplify the process, output/ will only contains terms in the form of dis_..._dis and gene_..._gene

Example Query

spark-submit similarity_fast.py project2_test.txt "gene_egfr+_gene"

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
windows_cmd_makefile		windows_cmd_makefile
.DS_Store		.DS_Store
README.md		README.md
example.txt		example.txt
exampleOutput.png		exampleOutput.png
project2_demo.txt		project2_demo.txt
project2_test.txt		project2_test.txt
similarity_fast.py		similarity_fast.py
similarity_slow.py		similarity_slow.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cosine Similarity

Overview

Requirements

Run

P.S.

Example Query

Output

About

Releases

Packages

Contributors 2

Languages

liulanz/Cosine-Similarity

Folders and files

Latest commit

History

Repository files navigation

Cosine Similarity

Overview

Requirements

Run

P.S.

Example Query

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages