This the official release of the SCANL ensemble part-of-speech tagger.
Please clone recursive since we are currently using submodules. This may change in the future.
git clone --recursive https://github.com/SCANL/ensemble_tagger.git
You will need python3 installed. We will explicitly use the python3 command below but, of course, if your environment is configured to use python3 by default, you do not need to. We have also only tested this on Ubuntu 18. It most likely works on all recent versions of Ubuntu, but we cannot guarantee it will work in other environments.
If you are on Windows, the tagger has been confirmed to work on Ubuntu via WSL
You will also need to install JDK 15 or later. This page can help - https://www.linuxuprising.com/2020/09/how-to-install-oracle-java-15-on-ubuntu.html
You'll need to install pip3:
sudo apt-get install python3-pip
The tagger is split into two parts:
-
A C++ script which SAX parses a srcML archive and sends identifier names, along with other sorts of static information, to the python script via RESTful HTTP.
-
A Python Flask server that listens to a port for input on a specific route. Once it receives input in the form of an identifier, it runs three external taggers: SWUM, POSSE, and Stanford tagger. It then takes the output of these taggers and feeds it into a machine learning algorithm which decides the final POS tag for each word.
Before compiling the C++ script, you need to install:
- Cmake (tested on 3.5) -
sudo apt install cmake
- libxml2-dev -
sudo apt install libxml2-dev
To compile the C++ script, do the following in the root (i.e., ensemble_tagger) directory:
mkdir build
cd build
cmake ..
make -j3
(-j3 to make it go faster)
Once it is compiled, you should have an executable in the build/bin folder.
Before running the python server, you need to install required modules. To download all of the required modules, use:
sudo pip3 install -r requirements.txt
Configure PYTHONPATH
as well:
export PYTHONPATH=~/path/to/ensemble_tagger/ensemble_tagger_implementation
You will also need to configure POSSE (one of the taggers). Do the following:
- Install wordnet-dev
- Open POSSE/Scripts/getWordNetType.sh
- You MAY need to modify this line, which is at the top of the file:
/usr/bin/wn $1 | grep "Information available for (noun|verb|adj|adv) $1" | cut -d " " -f4
by changing the path to wordnet (/usr/bin/wn) to the path on your own system. But usr/bin is the typical installation directory so it is unlikely you need to do this step. - set your PERL5LIB path to point to the Scripts folder in POSSE's directory:
export PERL5LIB=~/path/to/ensemble_tagger/POSSE/Scripts
Finally, you need to install Spiral, which we use for identifier splitting:
sudo pip3 install git+https://github.com/casics/spiral.git
Once it is all installed, you should be able to run the server:
cd ensemble_tagger_implementation
python3 routes.py [MODEL]
Where MODEL can be one of the below. DTCP
is the default if you do not specify a model:
- DTCP
- RFCP
- DTCA
- RFCA
- DTNP
- RFNP
- DTNA
- RFNA
This will start the server, which will listen for identifier names sent via HTTP over the route:
http://127.0.0.1:5000/{identifier_type}/{identifier_name}/{code_context}
Where "code context" is one of:
- FUNCTION
- ATTRIBUTE
- CLASS
- DECLARATION
- PARAMETER
For example:
Tag a declaration: http://127.0.0.1:5000/int/numberArray/DECLARATION
Tag a function: http://127.0.0.1:5000/int/GetNumberArray(int* begin, int* end)/FUNCTION
Tag an class: http://127.0.0.1:5000/class/PersonRecord/CLASS
You should run the tests the validate that everything is set up at this point
Make sure you're in the ensemble_tagger_implementation
directory, then run:
python -m unittest
If the tests do not pass, something above is misconfigured. Re-scan over the instructions carefully. If you can't figure out what's wrong, make an issue.
You can use HTTP to interact with the server and get part-of-speech annotations. This is where the C++ script comes in. You can run this script using the following command, assuming you're in the build folder:
./bin/grabidentifiers {srcML file name}
This will run the program that automatically queries the route above using all identifiers in the srcml file. Make sure the server is running before you run the C++ script. Otherwise, it won't be able to communicate with the server.
If you are unfamiliar with srcML, check it out. Since the actual tagger is a web server, you don't have to use srcML. You could always use other AST-based code representations, or any other method of obtaining identifier information. If you decide not to use srcML, you should ignore the C++ script.
Please make an issue if you run into errors
- Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, and Emily Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.
The data used to train this tagger can be found here: https://github.com/SCANL/datasets/tree/master/ensemble_tagger_training_data
Find our other research at our webpage and check out the Identifier Name Structure Catalogue