Skip to content

Commit e84741a

Browse files
committed
Remove NOCOREF entities e.g. from AnCora.
1 parent 6c289d3 commit e84741a

2 files changed

Lines changed: 22 additions & 1 deletion

File tree

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
from udapi.core.block import Block
2+
import udapi.core.coref
3+
import re
4+
import logging
5+
6+
class RemoveNoCorefEntities(Block):
7+
"""
8+
Some corpora (e.g., AnCora) include annotation of named entities that are
9+
not annotated for coreference. To distinguish them, their cluster ID starts
10+
with 'NOCOREF' (optionally followed by entity type, so that one cluster
11+
still has just one type). We may want to remove such entities from datasets
12+
that are used to train coreference resolves, to prevent the resolvers from
13+
thinking that all members of a NOCOREF cluster are coreferential. That is
14+
what this block does.
15+
"""
16+
17+
def process_document(self, doc):
18+
entities = doc.coref_entities
19+
if not entities:
20+
return
21+
doc.coref_entities = [e for e in entities if not re.match(r'^NOCOREF', e.eid)]

udapi/core/coref.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -300,7 +300,7 @@ def __init__(self, eid, etype=None):
300300
self.split_ante = []
301301

302302
def __lt__(self, another):
303-
"""Does this CorefEntity precedes (word-order wise) `another` entity?
303+
"""Does this CorefEntity precede (word-order wise) `another` entity?
304304
305305
This method defines a total ordering of all entities
306306
by the first mention of each entity (see `CorefMention.__lt__`).

0 commit comments

Comments
 (0)