plagiarism_detection_pan2015

Plagiarism Detection Approach for PAN 2015 Text Alignment task This system is the implementation as detailed in [1] and [2] for the Text Alignment task at PAN 2015

REQUIREMENTS

To use the algorithm you need to install the following python modules:

PyStemmer 1.3.0 -> https://pypi.python.org/pypi/PyStemmer
nltk -> http://www.nltk.org/

USAGE

python PAN2015 <pairs> <source document folder> <suspicious document folder> <output folder>

Example:

python PAN2015_JCR.py E:/text-alignment/pan13-text-alignment-training-dataset-2013-01-21/pairs E:/text-alignment/pan13-text-alignment-training-dataset-2013-01-21/src E:/text-alignment/pan13-text-alignment-training-dataset-2013-01-21/susp C:/Users/sanchezperez15/Results

INPUT

It is a file containing the pairs of documents to be compare
Folder with all the source documents mentioned in
Folder with all the suspicios documents mentioned in
Folder were the resulting xml files will be store

OUTPUT

The results are store in the as a xml file with the following format as required at PAN 2015 [3]:

<document reference="suspicious-documentXYZ.txt">
<feature
  name="detected-plagiarism"
  this_offset="5"
  this_length="1000"
  source_reference="source-documentABC.txt"
  source_offset="100"
  source_length="1000"
/>
<feature ... />
...
</document>

For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt and source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document.

NOTE

In the main method the following lines allow comparing 2 documents:

sgsplag_obj = SGSPLAG(read_document(<path_to_suspicious_document>), read_document(<path_to_source_document>), parameters)
type_plag, summary_flag = sgsplag_obj.process()

where the results are stored in <sgsplag_obj.detections>.

We state this note in order to facilitate the reusing of this method outside the PAN requirements

REFERENCES

[1] Sanchez-Perez, M.A., Gelbukh, A., Sidorov, G.: Adaptive algorithm for plagiarism detection: The best-performing approach at PAN 2014 text alignment competition. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multi-modality, and Interaction - 6th International Conference of the CLEF Association, CLEF 2015, Toulouse, France, September 8-11, 2015, Proceedings. Lecture Notes in Computer Science, vol. 9283, pp. 402{413. Springer (2015)

[2] Sanchez-Perez, M.A., Gelbukh, A.F., Sidorov, G.: Dynamically adjustable approach through obfuscation type recognition. In: Cappellato, L., Ferro, N., Jones, G.J.F., SanJuan, E. (eds.) Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015. CEUR Workshop Proceedings, vol. 1391. CEUR-WS.org (2015), http://ceur-ws.org/Vol-1391/92-CR.pdf

[3] http://pan.webis.de/clef15/pan15-web/plagiarism-detection.html

*** For more questions do not hesitate to contact us! ***

Miguel Ángel Sánchez Pérez <miguel.sanchez.nan(?)gmail.com>
Alexander Gelbukh <gelbukh(?)gelbukh.com>
Grigori Sidorov <sidorov(?)cic.ipn.mx>

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
PAN2015.py		PAN2015.py
README.md		README.md
settings.xml		settings.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

plagiarism_detection_pan2015

About

Releases

Packages

Languages

License

CubasMike/plagiarism_detection_pan2015

Folders and files

Latest commit

History

Repository files navigation

plagiarism_detection_pan2015

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages