This application is a naive way to find code duplicated from StackOverflow
These instructions will get a copy of the project up and running on your local machine
Before running the project you will need to install the StackOverflow API and BeautifulSoup 4 libraries. Additionally you will need Redis, RQ and Dotenv
> pip install stackapi beautifulsoup4 redis rq python-dotenv
With Python 3
> pip3 install stackapi beautifulsoup4 redis python-dotenv
Download Redis for Windows from ServiceStack
Configure the redis.windows.conf to use a password and run with the following command
> redis-server.exe redis.windows.conf
In the project rename the '.env-example' to '.env'
You shouldn't have to change the REDIS_HOST or REDIS_PORT if running locally. Set the REDIS_PASSWORD to whatever you set when configuring the redis.windows.conf file.
To run a worker:
> rq worker -u redis://:<password>@localhost:6379
In order to run this you will need to open 4 workers, the queues they should be each listening to are:
sanitizer, filter, tokenizer, keywordextractor.
The program will feed jobs into these respective queues consecutively.
TODO: Create a shell/bash to open the workers automatically
To open RQ dashboard:
> rq-dashboard -H localhost -p 3000 --redis-password <password>
Once the dashboard is running you can navigate to localhost:3000 to access it.
We use the StackAPI to query for URLs containing Python code snippets and then use regular web scraping procedures to gather the "<code>" blocks to parse.
The list of code snippets is then added to a Redis queue which is processed through our pipeline and added to a locality sensitive hash forest for querying similarities.
The pipeline is organized as such:
The proxies gather URLs from the StackAPI load balanced so that we do not overload their systems or run out of queries.
The URLs are sent to the Scraper which produces the the page information for the Processor.
The processor separates out the plaintext and code snippets and sends the code snippets down the pipeline.
The main components of the pipeline are as follows:
- Processor
- Sanitizer
- Filter
- Tokenizer
- Keyword Extractor
The Sanitizer cleans the gathered code snippets by making sure that indentations are following Python requirements (sometimes indentations can be messed up by web formatting). It also strips the snippets of obvious plaintext or comments by parsing each line into an AST (does not need to parse for comments, since they are easily denoted by pound signs). Consequently, this implies that code sent to the filter is most likely Python. Thus this is a naive method of stripping whitespace.
The Filter makes sure that any code snippet handed to it is valid Python 3. Future work would be to provide a method of either translating Python 2 into 3 or validating code using a Google code detecting library instead of Python ASTs which can only compile code of the same version the script is running in.
The Tokenizer creates a list of tokens in order of appearance from the code snippet so that similarity detection can be accomplished.
The Keyword Extractor component takes the list of tokens and extracts standardized keywords out of them into a list of keywords. Thus, every code snippet at this point becomes an array of keyword strings. At this point the data may be serialized or put into the Hash to be serialized.
- StackApi - Python library for connecting to StackOverflow
- BeautifulSoup4 - Python library for extracting HTML and XML data
- Redis - Redis is an in-memory key-value database
- RQ - Python library for queueing jobs and processing them in the background