Releases: stanfordnlp/CoreNLP
v4.5.10 - Remove Patterns / Lucene and add Semgrex / Ssurgeon features
Remove Patterns
- Older versions of Lucene have a security issue: GHSA-g643-xq6w-r67c Unfortunately, Lucene V9.12 is not compatible with Java 8. We therefore want to remove Lucene from this release
- The one project using it in CoreNLP is the patterns directory. We remove this, perhaps temporarily. If you are making use of the patterns project, please file an issue and we will include it in a future Java 11 compatible release. (We are aware of at least one group which used that project, back in 2020)
Semgrex and Ssurgeon upgrades
:: uniq
operator at the end of a Semgrex expression allows for making results uniq across a set of node values<>
search in Semgrex means connected either as parent or child. Simplifies expressions where the direction of the connection doesn't matter- Ssurgeon EditNode now supports -removemorphofeatures to remove one or more features without removing all features
- Ssurgeon SplitWord now allows for exact word splitting, not just regex based splitting
- Ssurgeon MergeNodes can now merge multiple nodes at once, not just two
- add Ssurgeon SetPhraseHead operation to make a connected phrase in a dependency graph have a different head, possibly updating the relations between the children as well. Useful for changing the head of a proper noun phrase, for example
Hopefully minor interface changes
- We move VariableStrings from
trees/tregex
andsemgraph/semgrex
intoutil
. It turns out there were two copies of this code in the codebase. This may ruin serialized tregex outputs, if such a thing exists.
v4.5.9 - Security Updates and Semgrex / Ssurgeon features
Security updates
- Removed the ability to specify an external library for deserialization of annotations in the server. We believe this should not be necessary given the complete nature of the protobuf format, and this was reported as a potential security vulnerability: https://github.com/stanfordnlp/CoreNLP/security/advisories/GHSA-wv35-hv9v-526p If it turns out someone has a use case for this feature, please file an issue on github.
- Remove the naturalli demo, which is unsupported and likely not used anywhere given its Stanford-specific components
Semgrex / Ssurgeon features
- Semgrex can now search on negated attributes of a node using
!:
as the syntax: 7399e9b - Semgrex can now search on maps (especially morphological features) with the
:{feature:value}
syntax, as well as search for negative matches with{feature!:value}
: 84ac932 ff1d903 3c30b3b - Ssurgeon can now reindex nodes with ReindexGraph, such as in cases where a sentence was manually split in a conllu file: 156fad1
- Ssurgeon can remove a feature with EditNode using the -remove option: 8e7d121
Other minor updates
v4.5.8 - Package updates and minor bug fixes
-
Update German UD POS tagger to UD 2.14 data
-
Add Austrian German month names to the German tokenizer: #1454 Thank you @j3ernhard
-
Improve the constituency to dependency converter to remove quite a few validation errors. This includes adding the PTB Corrector as an earlier step when operating specifically on PTB data #1445
-
SSurgeon feature to split one word into multiple words: 13ede5a
-
Unravel recursion in SemanticGraph - 05804a3 Fixes one server crash observed in #1461
-
Package updates: update protobuf -> 3.25.5, javax -> 1.1.6 #1465 Unfortunately updating Lucene to fix all dependency security issues will require dropping Java 8 support
-
Fix the server caching of tokenizer annotators to include segmenter properties as well. Avoids the server not respecting a request for a different segmentation model. 6f6eb93
v4.5.7 - Constituency to Dependency Converter Upgrades
UD converter upgrades
Inspired by UniversalDependencies/docs#717, although the work is not finished
- Add an option to use the PTBCorrector, which fixes many (although not all) incorrect POS tags 5e57eab
- Treat
sort of
the same askind of
bc4acf1 en masse
is flat cb338cddinna
is an MWT 1dd746c- Use
AUX
as the POS in the converter when appropriate 30f2f8e - Fix (heh)
all but
andwhether or not
2513676 - Dependency
dep
->ccomp
for frontedsay
verbs a76a854
Parser evaluation improvements
- Include the F1 scores of each tree when scoring a constituency dataset 2725b06
v4.5.6: Lemmatizer & Tokenizer bugfixes
English Lemmatizer upgrades
- enroll, appall as American spellings, instead of enrol & appal. de- as a verb prefix, blog and xfer as double letter exceptions 8adcbfe
- cowritten 2dd08da
- elder / eldest 9b5bec8
- Yazidi as a demonym 2852da8
Tokenizer upgrades
UD Processing upgrades
- 'twas and 'tis as MWT in the UD converter b9f19a6
- Sort morpho features in alphabetical order when writing out UD
f77a9b4
Other Bugfixes
- Crash when deleting the endpoints of an
IntervalTree
#1405 6d17c23 - Find and remove extraneous uses of
yield
, which became a keyword: e5c9d44 b084233
Minor API change
- Updating the text on a CoreLabel no longer wipes out the Lemma c03522b
- Update to more recent Jakarta Servlet 8a671fd
Ssurgeon
v4.5.5: further Ssurgeon upgrades, SceneGraph server module, security bugfix
Ssurgeon updates beyond the capabilities listed in the GURT paper
- MergeNodes operation: combine two words into one word in a graph. one word must be a leaf headed by the other for this to work 0660fa9
- CombineMWT operation: mark MWT on two or more words. Stanza will treat these as
Token
010a955 - DeleteLeaf operation: remove a leaf, renumber the subsequent words
429f61a
Bugfixes
- fix graph serialization for sentences longer than 128 words (
IdentityHashSet
doesn't work for integers beyond 128) d8d9d9f - fix
valueOf
forSemanticGraph
if a word is just a dash 203eb06 - fix memory usage of evaluating a PCFG model, which would run out of memory because it was saving all of the charts while evaluating b2e67b0
- Tregex pattern would not correctly display when using optional patterns: a9965b2 8659653
- Tregex would infinite loop on certain optional patterns which were theoretically legal cc7983e
Security fixes
- update xom to 1.3.9, which should avoid unwanted, potentially vulnerable transitive dependencies
c8772b7 - remove bz2 zip & unzip, which used a shell command and therefore could be hijacked https://nvd.nist.gov/vuln/detail/CVE-2023-39020
English dependency converter fixes
- addressing issue #1363
- fix
(QP up to ...)
8c46648 9a86ece - fix
up to 1700 kilograms
if misparsed in a predicable manner 6e14527 - better
LST
coverage 5745de5 vmod/acl
when the parser misinterpretsNP
vsNML
ad4556d- treat lists of
NML
as repeated modifiers of a noun, instead of a list, as that is the likely meaning ofNML
. example:a 72-game, three-month season
from PTB 61ef545 5e748dc
Server features
- Scenegraph endpoint 8b40947 #1346
- remove one json library to reduce number of json libraries we depend on 357b1bb
Small changes
- allow
fourty
as a number in SUTime 7fbb7b8 - capture
forty (40) days
as a duration in SUTime b3c47a0 - feature to print out the feature index of an NER model as a text file f636673
- clarify the INTJ rule for the ChineseHeadFinder 56cd6bb
- consider
{
}
as punctuation when scoring English constituency treebanks a606afa - fix error in test case, from @tanloong #1373 #1372
- dead code cleanup 86b6a03
v4.5.4: Minor Ssurgeon updates
- Minor Ssurgeon bugfixes (make it harder to infinite loop with EditNode or RelabelNamedEdge)
- Add a ReattachNamedEdge which is a combination of RemoveNamedEdge and AddEdge with new endpoints
- include the Morphology CLI for using the CoreNLP lemmatizer from elsewhere, such as Python
v4.5.3: Ssurgeon interface, Collinizer fixes
Mostly changes to Semgrex, along with adding Ssurgeon to the download package for general consumption. This involved quite a few changes to classes such as AnnotationLookup
. The released version should now match the Semgrex/Ssurgeon paper published at GURT 2023.
Ssurgeon / Semgrex
- Update Semgrex and Ssurgeon to match the paper published at GURT: https://aclanthology.org/2023.tlt-1.7/
Bugfixes
- Fix "Could not match" errors which occurred when scoring treebanks using a tagger that produces non-gold punct tags: #1344
- Fix typo in KBP children rules: dbdb55b
Minor features
v4.5.2: package dependencies, CLI additions
v4.5.1: Bugfixes
CoreNLP 4.5.1
Bugfixes!
- Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word 974383a
- Use a
LinkedHashMap
in the PTBTokenizer instead ofProperties
. Keeps the option processing order predictable. #1289 6550188 - Fix
\r\n
not being properly processed on Windows: #1291 9889f4e - Handle one half of surrogate character pairs in the tokenizer w/o crashing #1298 1b12faa
- Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: #1296 #1229 #1169 f99b5ab