Releases: stanfordnlp/CoreNLP
v4.5.8 - Package updates and minor bug fixes
-
Update German UD POS tagger to UD 2.14 data
-
Add Austrian German month names to the German tokenizer: #1454 Thank you @j3ernhard
-
Improve the constituency to dependency converter to remove quite a few validation errors. This includes adding the PTB Corrector as an earlier step when operating specifically on PTB data #1445
-
SSurgeon feature to split one word into multiple words: 13ede5a
-
Unravel recursion in SemanticGraph - 05804a3 Fixes one server crash observed in #1461
-
Package updates: update protobuf -> 3.25.5, javax -> 1.1.6 #1465 Unfortunately updating Lucene to fix all dependency security issues will require dropping Java 8 support
-
Fix the server caching of tokenizer annotators to include segmenter properties as well. Avoids the server not respecting a request for a different segmentation model. 6f6eb93
v4.5.7 - Constituency to Dependency Converter Upgrades
UD converter upgrades
Inspired by UniversalDependencies/docs#717, although the work is not finished
- Add an option to use the PTBCorrector, which fixes many (although not all) incorrect POS tags 5e57eab
- Treat
sort of
the same askind of
bc4acf1 en masse
is flat cb338cddinna
is an MWT 1dd746c- Use
AUX
as the POS in the converter when appropriate 30f2f8e - Fix (heh)
all but
andwhether or not
2513676 - Dependency
dep
->ccomp
for frontedsay
verbs a76a854
Parser evaluation improvements
- Include the F1 scores of each tree when scoring a constituency dataset 2725b06
v4.5.6: Lemmatizer & Tokenizer bugfixes
English Lemmatizer upgrades
- enroll, appall as American spellings, instead of enrol & appal. de- as a verb prefix, blog and xfer as double letter exceptions 8adcbfe
- cowritten 2dd08da
- elder / eldest 9b5bec8
- Yazidi as a demonym 2852da8
Tokenizer upgrades
UD Processing upgrades
- 'twas and 'tis as MWT in the UD converter b9f19a6
- Sort morpho features in alphabetical order when writing out UD
f77a9b4
Other Bugfixes
- Crash when deleting the endpoints of an
IntervalTree
#1405 6d17c23 - Find and remove extraneous uses of
yield
, which became a keyword: e5c9d44 b084233
Minor API change
- Updating the text on a CoreLabel no longer wipes out the Lemma c03522b
- Update to more recent Jakarta Servlet 8a671fd
Ssurgeon
v4.5.5: further Ssurgeon upgrades, SceneGraph server module, security bugfix
Ssurgeon updates beyond the capabilities listed in the GURT paper
- MergeNodes operation: combine two words into one word in a graph. one word must be a leaf headed by the other for this to work 0660fa9
- CombineMWT operation: mark MWT on two or more words. Stanza will treat these as
Token
010a955 - DeleteLeaf operation: remove a leaf, renumber the subsequent words
429f61a
Bugfixes
- fix graph serialization for sentences longer than 128 words (
IdentityHashSet
doesn't work for integers beyond 128) d8d9d9f - fix
valueOf
forSemanticGraph
if a word is just a dash 203eb06 - fix memory usage of evaluating a PCFG model, which would run out of memory because it was saving all of the charts while evaluating b2e67b0
- Tregex pattern would not correctly display when using optional patterns: a9965b2 8659653
- Tregex would infinite loop on certain optional patterns which were theoretically legal cc7983e
Security fixes
- update xom to 1.3.9, which should avoid unwanted, potentially vulnerable transitive dependencies
c8772b7 - remove bz2 zip & unzip, which used a shell command and therefore could be hijacked https://nvd.nist.gov/vuln/detail/CVE-2023-39020
English dependency converter fixes
- addressing issue #1363
- fix
(QP up to ...)
8c46648 9a86ece - fix
up to 1700 kilograms
if misparsed in a predicable manner 6e14527 - better
LST
coverage 5745de5 vmod/acl
when the parser misinterpretsNP
vsNML
ad4556d- treat lists of
NML
as repeated modifiers of a noun, instead of a list, as that is the likely meaning ofNML
. example:a 72-game, three-month season
from PTB 61ef545 5e748dc
Server features
- Scenegraph endpoint 8b40947 #1346
- remove one json library to reduce number of json libraries we depend on 357b1bb
Small changes
- allow
fourty
as a number in SUTime 7fbb7b8 - capture
forty (40) days
as a duration in SUTime b3c47a0 - feature to print out the feature index of an NER model as a text file f636673
- clarify the INTJ rule for the ChineseHeadFinder 56cd6bb
- consider
{
}
as punctuation when scoring English constituency treebanks a606afa - fix error in test case, from @tanloong #1373 #1372
- dead code cleanup 86b6a03
v4.5.4: Minor Ssurgeon updates
- Minor Ssurgeon bugfixes (make it harder to infinite loop with EditNode or RelabelNamedEdge)
- Add a ReattachNamedEdge which is a combination of RemoveNamedEdge and AddEdge with new endpoints
- include the Morphology CLI for using the CoreNLP lemmatizer from elsewhere, such as Python
v4.5.3: Ssurgeon interface, Collinizer fixes
Mostly changes to Semgrex, along with adding Ssurgeon to the download package for general consumption. This involved quite a few changes to classes such as AnnotationLookup
. The released version should now match the Semgrex/Ssurgeon paper published at GURT 2023.
Ssurgeon / Semgrex
- Update Semgrex and Ssurgeon to match the paper published at GURT: https://aclanthology.org/2023.tlt-1.7/
Bugfixes
- Fix "Could not match" errors which occurred when scoring treebanks using a tagger that produces non-gold punct tags: #1344
- Fix typo in KBP children rules: dbdb55b
Minor features
v4.5.2: package dependencies, CLI additions
v4.5.1: Bugfixes
CoreNLP 4.5.1
Bugfixes!
- Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word 974383a
- Use a
LinkedHashMap
in the PTBTokenizer instead ofProperties
. Keeps the option processing order predictable. #1289 6550188 - Fix
\r\n
not being properly processed on Windows: #1291 9889f4e - Handle one half of surrogate character pairs in the tokenizer w/o crashing #1298 1b12faa
- Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: #1296 #1229 #1169 f99b5ab
v4.5.0
CoreNLP 4.5.0
Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex
-
All PTB and German tokens normalized now in PTBLexer (previously only German umlauts).
This makes the tokenizer 2% slower, but should avoid issues with resume' for example
d46fecd -
log4j removed entirely from public CoreNLP (internal "research" branch still has a use)
f05cb54 -
Fix NumberFormatException showing up in NER models: #547 5ee2c39
-
Fix "seconds" in the lemmatizer: e7a073b
-
Fix double escaping of & in the online demos: 8413fa1
-
Report the cause of an error if "tregex" is asked for but no parse annotator is added: 4db80c0
-
Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): #1259
-
Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: #1263
-
Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: 3c40ba3 58a2288 8b97d64
-
Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas 9476a8e 6193934 afb1ea8 7c84960
-
Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases #1266
-
Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) 45b47e2
-
Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models 0d9e9c8
-
Fix NBSP in the Chinese segmenter stanfordnlp/stanza#1052 #1279
v4.4.0
Enhancements
-
added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline
-
tsurgeon CLI - python side added to stanza
#1240 -
sutime WORKDAY definition
0dfb118
Fixes
-
rebuilt Italian dependency parser using CoreNLP predicted tags
-
XML security issue:
#1241 -
NER server security issue:
5ee097d -
fix infinite loop in tregex:
#1238 -
json utf-8 output on windows
#1231
stanfordnlp/stanza#894 -
fix nondeterministic results in certain SemanticGraph structures
#1228
cc806f2 -
workaround for NLTK sending % unescaped to the server
#1226
20fe1e9 -
make TimingTest function on Windows
4aafb84