Releases · stanfordnlp/CoreNLP

07 Jun 20:00

AngledLuffa

v4.5.10

1b7edd1

v4.5.10 - Remove Patterns / Lucene and add Semgrex / Ssurgeon features Latest

Latest

Remove Patterns

Older versions of Lucene have a security issue: GHSA-g643-xq6w-r67c Unfortunately, Lucene V9.12 is not compatible with Java 8. We therefore want to remove Lucene from this release
The one project using it in CoreNLP is the patterns directory. We remove this, perhaps temporarily. If you are making use of the patterns project, please file an issue and we will include it in a future Java 11 compatible release. (We are aware of at least one group which used that project, back in 2020)

Semgrex and Ssurgeon upgrades

:: uniq operator at the end of a Semgrex expression allows for making results uniq across a set of node values
<> search in Semgrex means connected either as parent or child. Simplifies expressions where the direction of the connection doesn't matter
Ssurgeon EditNode now supports -removemorphofeatures to remove one or more features without removing all features
Ssurgeon SplitWord now allows for exact word splitting, not just regex based splitting
Ssurgeon MergeNodes can now merge multiple nodes at once, not just two
add Ssurgeon SetPhraseHead operation to make a connected phrase in a dependency graph have a different head, possibly updating the relations between the children as well. Useful for changing the head of a proper noun phrase, for example

Hopefully minor interface changes

We move VariableStrings from trees/tregex and semgraph/semgrex into util. It turns out there were two copies of this code in the codebase. This may ruin serialized tregex outputs, if such a thing exists.

Assets 2

07 Apr 15:04

AngledLuffa

v4.5.9

cabc020

v4.5.9 - Security Updates and Semgrex / Ssurgeon features

Security updates

Removed the ability to specify an external library for deserialization of annotations in the server. We believe this should not be necessary given the complete nature of the protobuf format, and this was reported as a potential security vulnerability: https://github.com/stanfordnlp/CoreNLP/security/advisories/GHSA-wv35-hv9v-526p If it turns out someone has a use case for this feature, please file an issue on github.
Remove the naturalli demo, which is unsupported and likely not used anywhere given its Stanford-specific components

Semgrex / Ssurgeon features

Semgrex can now search on negated attributes of a node using !: as the syntax: 7399e9b
Semgrex can now search on maps (especially morphological features) with the :{feature:value} syntax, as well as search for negative matches with {feature!:value}: 84ac932 ff1d903 3c30b3b
Ssurgeon can now reindex nodes with ReindexGraph, such as in cases where a sentence was manually split in a conllu file: 156fad1
Ssurgeon can remove a feature with EditNode using the -remove option: 8e7d121

Other minor updates

Additional demonyms now supported in the lemmatizer, demonyms from LinES and ParTUT: 4f15b08
Output lemmas when training a tagger whenever available if -outputLemmas is set, even if not verbose 94739c7

Assets 2

29 Dec 08:15

AngledLuffa

v4.5.8

5970639

v4.5.8 - Package updates and minor bug fixes

Update German UD POS tagger to UD 2.14 data
Add Austrian German month names to the German tokenizer: #1454 Thank you @j3ernhard
Improve the constituency to dependency converter to remove quite a few validation errors. This includes adding the PTB Corrector as an earlier step when operating specifically on PTB data #1445
SSurgeon feature to split one word into multiple words: 13ede5a
Unravel recursion in SemanticGraph - 05804a3 Fixes one server crash observed in #1461
Package updates: update protobuf -> 3.25.5, javax -> 1.1.6 #1465 Unfortunately updating Lucene to fix all dependency security issues will require dropping Java 8 support
Fix the server caching of tokenizer annotators to include segmenter properties as well. Avoids the server not respecting a request for a different segmentation model. 6f6eb93

Contributors

j3ernhard

Assets 2

28 Apr 05:36

AngledLuffa

v4.5.7

2460079

v4.5.7 - Constituency to Dependency Converter Upgrades

UD converter upgrades

Inspired by UniversalDependencies/docs#717, although the work is not finished

Add an option to use the PTBCorrector, which fixes many (although not all) incorrect POS tags 5e57eab
Treat sort of the same as kind of bc4acf1
en masse is flat cb338cd
dinna is an MWT 1dd746c
Use AUX as the POS in the converter when appropriate 30f2f8e
Fix (heh) all but and whether or not 2513676
Dependency dep -> ccomp for fronted say verbs a76a854

Parser evaluation improvements

Include the F1 scores of each tree when scoring a constituency dataset 2725b06

Assets 2

01 Feb 20:39

AngledLuffa

v4.5.6

71bc256

v4.5.6: Lemmatizer & Tokenizer bugfixes

English Lemmatizer upgrades

enroll, appall as American spellings, instead of enrol & appal. de- as a verb prefix, blog and xfer as double letter exceptions 8adcbfe
cowritten 2dd08da
elder / eldest 9b5bec8
Yazidi as a demonym 2852da8

Tokenizer upgrades

#number as a single thing after an abbreviation #1396 ad37f2a

UD Processing upgrades

'twas and 'tis as MWT in the UD converter b9f19a6
Sort morpho features in alphabetical order when writing out UD
f77a9b4

Other Bugfixes

Crash when deleting the endpoints of an IntervalTree #1405 6d17c23
Find and remove extraneous uses of yield, which became a keyword: e5c9d44 b084233

Minor API change

Updating the text on a CoreLabel no longer wipes out the Lemma c03522b
Update to more recent Jakarta Servlet 8a671fd

Ssurgeon

UpdateMorphoFeatures edit 27c6703
Lemmatize operation (only works on English) c26b25e

Assets 2

06 Sep 20:46

AngledLuffa

v4.5.5

b5a632c

v4.5.5: further Ssurgeon upgrades, SceneGraph server module, security bugfix

Ssurgeon updates beyond the capabilities listed in the GURT paper

MergeNodes operation: combine two words into one word in a graph. one word must be a leaf headed by the other for this to work 0660fa9
CombineMWT operation: mark MWT on two or more words. Stanza will treat these as Token 010a955
DeleteLeaf operation: remove a leaf, renumber the subsequent words
429f61a

Bugfixes

fix graph serialization for sentences longer than 128 words (IdentityHashSet doesn't work for integers beyond 128) d8d9d9f
fix valueOf for SemanticGraph if a word is just a dash 203eb06
fix memory usage of evaluating a PCFG model, which would run out of memory because it was saving all of the charts while evaluating b2e67b0
Tregex pattern would not correctly display when using optional patterns: a9965b2 8659653
Tregex would infinite loop on certain optional patterns which were theoretically legal cc7983e

Security fixes

update xom to 1.3.9, which should avoid unwanted, potentially vulnerable transitive dependencies
c8772b7
remove bz2 zip & unzip, which used a shell command and therefore could be hijacked https://nvd.nist.gov/vuln/detail/CVE-2023-39020

English dependency converter fixes

addressing issue #1363
fix (QP up to ...) 8c46648 9a86ece
fix up to 1700 kilograms if misparsed in a predicable manner 6e14527
better LST coverage 5745de5
vmod/acl when the parser misinterprets NP vs NML ad4556d
treat lists of NML as repeated modifiers of a noun, instead of a list, as that is the likely meaning of NML. example: a 72-game, three-month season from PTB 61ef545 5e748dc

Server features

Scenegraph endpoint 8b40947 #1346
remove one json library to reduce number of json libraries we depend on 357b1bb

Small changes

allow fourty as a number in SUTime 7fbb7b8
capture forty (40) days as a duration in SUTime b3c47a0
feature to print out the feature index of an NER model as a text file f636673
clarify the INTJ rule for the ChineseHeadFinder 56cd6bb
consider { } as punctuation when scoring English constituency treebanks a606afa
fix error in test case, from @tanloong #1373 #1372
dead code cleanup 86b6a03

Contributors

tanloong

Assets 2

16 Mar 01:23

AngledLuffa

v4.5.4

1398932

v4.5.4: Minor Ssurgeon updates

Minor Ssurgeon bugfixes (make it harder to infinite loop with EditNode or RelabelNamedEdge)
Add a ReattachNamedEdge which is a combination of RemoveNamedEdge and AddEdge with new endpoints
include the Morphology CLI for using the CoreNLP lemmatizer from elsewhere, such as Python

Assets 2

11 Mar 05:40

AngledLuffa

v4.5.3

9ea4f39

v4.5.3: Ssurgeon interface, Collinizer fixes

Mostly changes to Semgrex, along with adding Ssurgeon to the download package for general consumption. This involved quite a few changes to classes such as AnnotationLookup. The released version should now match the Semgrex/Ssurgeon paper published at GURT 2023.

Ssurgeon / Semgrex

Update Semgrex and Ssurgeon to match the paper published at GURT: https://aclanthology.org/2023.tlt-1.7/

Bugfixes

Fix "Could not match" errors which occurred when scoring treebanks using a tagger that produces non-gold punct tags: #1344
Fix typo in KBP children rules: dbdb55b

Minor features

Add the choice of dependency graph to output to the TextOutputter 33e6c42 #1339
Hopefully minor interface change: make relation in SemanticGraphEdge final, get rid of setRelation e7a7657

Assets 2

11 Mar 05:32

AngledLuffa

v4.5.2

a8aaaf2

v4.5.2: package dependencies, CLI additions

Bugfixes

Tokenize c'mon and $$$ 1e216de
Tokenize 'email' 76b5a6b #1316
Return empty mentions for empty document da08664 #1322
Fix CLI protobuf tools running too fast for some network conditions: 412da5c

CLI protobuf tools

Add output of lemmatizer to words 71bc95d
Convert constituency trees to dependencies b118082

Dependency updates

Protobuf 3.19.6 0439b62
xom 1.3.8, which no longer automatically includes xalan 3ded6f0

Semgraph / Semgrex improvements

Allow reuse of indices in SemanticGraph.valueOf cf97e36
Add Semgrex relations to match the capabilities introduced in Spacy 98be52a

Assets 2

30 Aug 04:13

AngledLuffa

v4.5.1

f7782ff

v4.5.1: Bugfixes

CoreNLP 4.5.1

Bugfixes!

Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word 974383a
Use a LinkedHashMap in the PTBTokenizer instead of Properties. Keeps the option processing order predictable. #1289 6550188
Fix \r\n not being properly processed on Windows: #1291 9889f4e
Handle one half of surrogate character pairs in the tokenizer w/o crashing #1298 1b12faa
Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: #1296 #1229 #1169 f99b5ab

Assets 2

Releases: stanfordnlp/CoreNLP

v4.5.10 - Remove Patterns / Lucene and add Semgrex / Ssurgeon features

Remove Patterns

Semgrex and Ssurgeon upgrades

Hopefully minor interface changes

Uh oh!

v4.5.9 - Security Updates and Semgrex / Ssurgeon features

Security updates

Semgrex / Ssurgeon features

Other minor updates

Uh oh!

v4.5.8 - Package updates and minor bug fixes

Contributors

Uh oh!

v4.5.7 - Constituency to Dependency Converter Upgrades

UD converter upgrades

Parser evaluation improvements

Uh oh!

v4.5.6: Lemmatizer & Tokenizer bugfixes

English Lemmatizer upgrades

Tokenizer upgrades

UD Processing upgrades

Other Bugfixes

Minor API change

Ssurgeon

Uh oh!

v4.5.5: further Ssurgeon upgrades, SceneGraph server module, security bugfix

Ssurgeon updates beyond the capabilities listed in the GURT paper

Bugfixes

Security fixes

English dependency converter fixes

Server features

Small changes

Contributors

Uh oh!

v4.5.4: Minor Ssurgeon updates

Uh oh!

v4.5.3: Ssurgeon interface, Collinizer fixes

Ssurgeon / Semgrex

Bugfixes

Minor features

Uh oh!

v4.5.2: package dependencies, CLI additions

Bugfixes

CLI protobuf tools

Dependency updates

Semgraph / Semgrex improvements

Uh oh!

v4.5.1: Bugfixes

CoreNLP 4.5.1

Bugfixes!

Uh oh!