One week in late 2013, a code-deduplication hackathon took place in the genome codebase at McDonnell Genome Institute. Instead of just participating directly, some colleagues and I thought it would be a neat idea to apply some well known techniques for duplicate sequence detection from the realm of bioinformatics to our code base. (note: my email has since changed so I don't appear in the list of contributors. Notice my contributions via name in the list of commits)
This is the result of our efforts.
Super-maximal repeats are detected as described here. The pygments python library is used as a frontend to tokenize and pre-process source code in various languages.
The obvious next step is to support the clang front-end for building ASTs and deduplicating based on that data.
- pygments
- yaml
on debianish systems:
sudo apt-get install python-pygments python-yaml