Code Duplication

This application is a naive way to find code duplicated from StackOverflow

Getting Started

These instructions will get a copy of the project up and running on your local machine

Prerequisites

Before running the project you will need to install the StackOverflow API and BeautifulSoup 4 libraries. Additionally you will need Redis, RQ and Dotenv

> pip install stackapi beautifulsoup4 redis rq python-dotenv

With Python 3

> pip3 install stackapi beautifulsoup4 redis python-dotenv

Redis Configuration

Download Redis for Windows from ServiceStack

Configure the redis.windows.conf to use a password and run with the following command

> redis-server.exe redis.windows.conf

In the project rename the '.env-example' to '.env'

You shouldn't have to change the REDIS_HOST or REDIS_PORT if running locally. Set the REDIS_PASSWORD to whatever you set when configuring the redis.windows.conf file.

To run a worker:

> rq worker -u redis://:<password>@localhost:6379

In order to run this you will need to open 4 workers, the queues they should be each listening to are: sanitizer, filter, tokenizer, keywordextractor.

The program will feed jobs into these respective queues consecutively.

TODO: Create a shell/bash to open the workers automatically

To open RQ dashboard:

> rq-dashboard -H localhost -p 3000 --redis-password <password>

Once the dashboard is running you can navigate to localhost:3000 to access it.

Structure

Procedure

We use the StackAPI to query for URLs containing Python code snippets and then use regular web scraping procedures to gather the "<code>" blocks to parse.

The list of code snippets is then added to a Redis queue which is processed through our pipeline and added to a locality sensitive hash forest for querying similarities.

Pipeline Structure

The pipeline is organized as such:

The proxies gather URLs from the StackAPI load balanced so that we do not overload their systems or run out of queries.

The URLs are sent to the Scraper which produces the the page information for the Processor.

Processor

The processor separates out the plaintext and code snippets and sends the code snippets down the pipeline.

The main components of the pipeline are as follows:

Processor
Sanitizer
Filter
Tokenizer
Keyword Extractor

Sanitizer

The Sanitizer cleans the gathered code snippets by making sure that indentations are following Python requirements (sometimes indentations can be messed up by web formatting). It also strips the snippets of obvious plaintext or comments by parsing each line into an AST (does not need to parse for comments, since they are easily denoted by pound signs). Consequently, this implies that code sent to the filter is most likely Python. Thus this is a naive method of stripping whitespace.

Filter

The Filter makes sure that any code snippet handed to it is valid Python 3. Future work would be to provide a method of either translating Python 2 into 3 or validating code using a Google code detecting library instead of Python ASTs which can only compile code of the same version the script is running in.

Tokenizer

The Tokenizer creates a list of tokens in order of appearance from the code snippet so that similarity detection can be accomplished.

Keyword Extractor

The Keyword Extractor component takes the list of tokens and extracts standardized keywords out of them into a list of keywords. Thus, every code snippet at this point becomes an array of keyword strings. At this point the data may be serialized or put into the Hash to be serialized.

Built With

StackApi - Python library for connecting to StackOverflow
BeautifulSoup4 - Python library for extracting HTML and XML data
Redis - Redis is an in-memory key-value database
RQ - Python library for queueing jobs and processing them in the background

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
data		data
docs		docs
qa		qa
stackoversight		stackoversight
.env-example		.env-example
.gitignore		.gitignore
ISSUE_TEMPLATE.md		ISSUE_TEMPLATE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pipeline-structure.png		pipeline-structure.png
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Duplication

Getting Started

Prerequisites

Redis Configuration

Structure

Procedure

Pipeline Structure

Processor

Sanitizer

Filter

Tokenizer

Keyword Extractor

Built With

Authors

About

Releases

Packages

Contributors 3

Languages

License

walker76-research/stackoversight

Folders and files

Latest commit

History

Repository files navigation

Code Duplication

Getting Started

Prerequisites

Redis Configuration

Structure

Procedure

Pipeline Structure

Processor

Sanitizer

Filter

Tokenizer

Keyword Extractor

Built With

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages