Thanks for your interest in contributing to pySBD 🎉. The project is maintained by @nipunsadvilkar, and I'll do my best to help you get started. This page will give you a quick overview of how things are organised and most importantly, how to get involved.
- Issues and bug reports
a. Submitting issues
b. Issue labels - Contributing to the code base
a. Getting started
b. Add a new rule to existing Golden Rules Set (GRS)
c. Add new language support
d. Add tests
e. Fix bugs
First, do a quick search to see if the issue has already been reported or already open. If so, it's often better to just leave a comment on an existing issue, rather than creating a new one. Old issues also often include helpful tips and solutions to common problems.
Please understand that author won't be able to provide individual support via email. Author also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it.
When opening an issue, use an appropriate and descriptive title and include your environment (operating system, Python version, pySBD version). Choose the report type from here, if type is not available then open a blank issue. The issue template helps you remember the most important details to include. If you've discovered a bug, you can also submit a regression test straight away. When you're opening an issue to report the bug, simply refer to your pull request in the issue body. A few more tips:
-
Describing your issue: Try to provide as many details as possible. What exactly goes wrong? How is it failing? Is there an error? "XY doesn't work" usually isn't that helpful for tracking down problems. Always remember to include the code you ran and if possible, extract only the relevant parts and don't just dump your entire script. Also, provide what was the expected output for given input. This will make it easier for contributors to reproduce the error.
-
Getting info about your pySBD installation and environment: You can use the command line interface to print details and copy-paste psybd verson along with python version into GitHub issues:
pip freeze|grep pysbd
. -
Sharing long blocks of code/logs/tracebacks: If you need to include long code, logs or tracebacks, you can wrap them in
<details>
and</details>
. This collapses the content so it only becomes visible on click, making the issue easier to read and follow.
See this page for an overview of the system author uses to tag our issues and pull requests.
Happy to see you contibute to pySBD codebase. To help you get started and understand internals of pySBD, a good place to start is to refer to the implementation section of pySBD research paper. Another great place for reference is to look at merged pull requests. Depending on the type of your contribution, refer to the assigned labels.
To make changes to pySBD's code base, you need to fork then clone the GitHub repository to your local machine. You'll need to make sure that you have a development environment consisting of a Python distribution including python 3+, pip and git installed.
python -m pip install -U pip
git clone https://github.com/nipunsadvilkar/pySBD
cd pySBD
pip install -r requirements-dev.txt
Since pySBD is lightweight, it requires only python inbuilt modules, more specifically python re
module to function. Development packages requiremment will be provided in requirements-dev.txt
. If you want to use pySBD as a spacy component then install spacy in your environment.
The language specific Golden Rules Set are hand-constructed rules, designed to cover sentence boundaries across a variety of domains. The set is by no means complete and will evolve and expand over time. If you would like to report an issue in existing rule or report a new rule, please open an issue. If you want to contribute yourself then please go ahead and send pull request by referring to add tests section.
Great to see you adding new language support to pySBD ✨.
You would need following steps to add new language support:
^^ Please use already supported language commits - Marathi, Spanish, Chinese - as a frame of reference as you go through each steps below.
-
New Language Specific Golden Rules Set
You would require to create Golden Rule Set representing basic to complex sentence boundary variations as a test set. Assuming you know the language, its sentence syntax and other intricacies you can create a new file attests/lang/test_<language_name>.py
and enlist input text and expected output in the same way author has added support for existing^^ languages. You may want to refer to adding tests section to know more details on how to add, run tests, adding language fixture. Next, run the tests usingpytest
and let it deliberately fail. -
Add your language module
Create a new file atpysbd/lang/<language_name>.py
and define a new classLanguageName
which should be inheriting from two base classes -Common, Standard
- involving basic rules common across majority of languages. Try running tests to see your GRS passes or not. If fails, you would need to overrideSENTENCE_BOUNDARY_REGEX
,Punctuations
class variables andAbbreviationReplacer
class to support your language specific punctuations, sentence boundaries.
Illustration: As you could see in
Marathi
language,AbbreviationReplacer
& itsSENTENCE_STARTERS
are kept blank to overrideStandard
'sSENTENCE_STARTERS
. Next,Punctuations
are limited to['.', '!', '?']
and as per itSENTENCE_BOUNDARY_REGEX
is constructed to make sure it would pass Marathi GRS. Similar to the class variables, if you find any rule not pertaining to your language then you can override it in your language class.
- Add language code
Your language module & language GRS should be in place by now. Next step is to make it available to pySBD'slanguages
module by importing your language module and adding a new key having ISO 639-1 equivalent language code belonging to your language to theLANGUAGE_CODES
dictionary and value as your language class you would have imported.
Author emphasizes on Test-Driven Development (TDD) approach to ensure robustness of the pySBD module. You will follow a "Red-Green-Refactor" cycle.
- Make sure you have proper development environment setup
- Depending on your type of contribution your test script would vary between feature-specific / bugfix-specific.
- (Red) Once you add those tests, run
pytest
to make sure it fails deliberately. - (Green) Write just enough code to implement your logic in respective python script to pass the specific test which you added and got failed earlier.
- Once it passes, run all the tests to see if your added code doesn't break existing code.
- (Refactor) Do necessary refactoring & cleaning to keep tests green.
- Repeat 🔁
When fixing a bug, first create an issue if one does not already exist. The description text can be very short – don't need to be verbose.
Next, depending on your type of issue, add your test in TEST_ISSUE_DATA
/ TEST_ISSUE_DATA_CHAR_SPANS
with a tuple ("#ISSUE_NUMBER", "<input_text>", <expected_output>)
in the
pysbd/tests/regression
folder. Test for the bug
you're fixing, and make sure the test fails. Next, add and commit your test file
referencing the issue number in the commit message. Finally, fix the bug, make
sure your test passes and reference the issue in your commit message.
Thank you for contributing! ✨ 🍰 ✨