Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding all bible book names as abbreviations #12

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions syntok/_segmentation_states.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,12 @@ class State(metaclass=ABCMeta):
abbreviations = frozenset(
"""
Abb adm Adm Abs afmo alt Alt Anl ap apdo approx Approx art Art atte atto Aufl ave Ave Az
bmo Bmo brig Bd Brig bsp Bsp bspw bzgl bzw ca cap capt Capt cf cmdt Cmdt cnel Cnel Co col Col Corp
de Dr dgl dt emp en es etc evtl excl exca Exca excmo Excmo exsmo Exsmo ff fig Fig figs Figs fr Fr
gal gen Gen ggf gral Gral GmbH gob Gob Hd hno Hno hnos Hnos Inc incl inkl lic Lic lit ldo Ldo Ltd
mag Mag max med Med min Mio mos Mr mr Mrd Mrs mrs Ms ms Mt mt MwSt nat Nat Nr nr ntra Ntra ntro Ntro
pag phil prof Prof rer Rer resp sci Sci Sr sr Sra sra Srta srta St st synth tab Tab tel Tel
univ Univ Urt vda Vda vol Vol vs vta zB zit zzgl
bmo Bmo brig Bd Brig bsp Bsp bspw bzgl bzw ca cap capt Capt cf cmdt Cmdt cnel Cnel Co col Col Dan Corp
Deut de Dr dgl Ecc Eccl dt Eph emp en Est Esth es etc Ex evtl excl exca Exca Exo excmo Excmo exsmo Exsmo Exod Ezek Ezra ff fig Fig figs Figs fr Fr
Gal gal Gen gen ggf gral Gral GmbH gob Hab Hag Gob Heb Hd hno Hno hnos Hos Hnos Jas Isa Jer Jon Inc Josh Judg Lam incl Lev inkl lic Lic lit ldo Ldo Ltd
mag Mal Matt Mag max med Mic Med min Mio mos Mr mr Mrd Mrs mrs Ms ms Mt mt Nah MwSt nat Neh Nat Nr nr ntra Ntra ntro Oba Num Obad Ntro
Phlm pag Pro Phil phil Ps prof Prof Psalm Prov rer Rev Rer Rom resp sci Sci Sr sr Sra sra Srta srta St st synth tab Tab tel Tel
univ Univ Urt vda Vda vol Vol vs Zech vta Zeph zB zit zzgl
Mon lun Tue mar Wed mie mié Thu jue Fri vie Sat sab Sun dom
""".split()
+ list(months)
Expand Down
8 changes: 8 additions & 0 deletions syntok/segmenter_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,14 @@ def test_do_not_split_short_text_inside_parenthesis(self):
result = segmenter.split(iter(tokens))
self.assertEqual([tokens], result)

def test_do_not_split_bible_citation(self):
tokens = Tokenizer().split(
"This is not a real quote? (Phil. 4:8) No, it's not."
)
result = segmenter.split(iter(tokens))
self.assertEqual(len(result[0]), 7)
self.assertEqual(len(result[1]), 11)

def test_do_not_split_short_text_inside_parenthesis2(self):
tokens = Tokenizer().split(
"This is (Proc. ABC with Abs. Reg. Compliance) not here."
Expand Down