udapi-python/udapi/core/coref.py at 0.5.2 · udapi/udapi-python

History

1086 lines (954 loc) · 46.1 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

"""Classes for handling coreference.

# CorefUD 1.0 format implementation details

## Rules for ordering "chunks" within `node.misc['Entity']`

Entity mentions are annotated using "chunks" stored in `misc['Entity']`.

Chunks are of three types:

1. opening bracket, e.g. `(e1-person`

2. closing bracket, e.g. `e1-person)`

3. single-word span (both opening and closing), e.g. `(e1-person)`

The `Entity` MISC attribute contains a sequence of chunks

without any separators, e.g. `Entity=(e1-person(e2-place)`

means opening `e1` mention and single-word `e2` mention

starting on a given node.

### Crossing mentions

Two mentions are crossing iff their spans have non-empty intersection,

but neither is a subset of the other, e.g. `e1` spanning nodes 1-3

and `e2` spanning 2-4 would be represented as:

```

1 ... Entity=(e1

2 ... Entity=(e2

3 ... Entity=e1)

4 ... Entity=e2)

```

This may be an annotation error and we may forbid such cases in future annotation guidelines,

but in CorefUD 0.2, there are thousands of such cases (see https://github.com/ufal/corefUD/issues/23).

It can even happen that one entity ends and another starts at the same node: `Entity=e1)(e2`

For this reason, we need

**Rule1**: closing brackets MUST always precede opening brackets.

Otherwise, we would get `Entity=(e2e1)`, which could not be parsed.

Note that we cannot have same-entity crossing mentions in the CorefUD 1.0 format,

so e.g. if we substitute `e2` with `e1` in the example above, we'll get

`(e1`, `e1)`, `(e1`, `e1)`, which will be interpreted as two non-overlapping mentions of the same entity.

### Nested mentions

One mention (span) can be often embedded within another mention (span).

It can happen that both these mentions correspond to the same entity (i.e. are in the same cluster),

for example, "`<the man <who> sold the world>`".

It can even happen that both mentions start at the same node, e.g. "`<<w1 w2> w3>`" (TODO: find nice real-world examples).

In such cases, we need to make sure the brackets are well-nested:

**Rule2**: when opening multiple brackets at the same node, longer mentions MUST be opened first.

This is important because

- The closing bracket has the same form for both mentions of the same entity - it includes just the entity ID (`eid`).

- The opening-bracket annotation contains other mention attributes, e.g. head index.

- The two mentions may differ in these attributes, e.g. the "`<w1 w2 w3>`" mention's head may be w3.

- When breaking Rule2, we would get

```

1 w1 ... Entity=(e1-person-1(e1-person-3

2 w2 ... Entity=e1)

3 w3 ... Entity=e1)

```

which would be interpreted as if the head of the "`<w1 w2>`" mention is its third word, which is invalid.

### Other rules

**Rule3**: when closing multiple brackets at the same node, shorter mentions SHOULD be closed first.

See Rule4 for a single exception from this rule regarding crossing mentions.

I'm not aware of any problems when breaking this rule, but it seems intuitive

(to make the annotation well-nested if possible) and we want to define some canonical ordering anyway.

The API should be able to load even files breaking Rule3.

**Rule4**: single-word chunks SHOULD follow all opening brackets and precede all closing brackets if possible.

When considering single-word chunks as a subtype of both opening and closing brackets,

this rule follows from the well-nestedness (and Rule2).

So we should have `Entity=(e1(e2)` and `Entity=(e3)e1)`,

but the API should be able to load even `Entity=(e2)(e1` and `Entity=e1)(e3)`.

In case of crossing mentions (annotated following Rule1), we cannot follow Rule4.

If we want to add a single-word mention `e2` to a node with `Entity=e1)(e3`,

it seems intuitive to prefer Rule2 over Rule3, which results in `Entity=e1)(e3(e2)`.

So the canonical ordering will be achieved by placing single-word chunks after all opening brackets.

The API should be able to load even `Entity=(e2)e1)(e3` and `Entity=e1)(e2)(e3`.

**Rule5**: ordering of same-span single-word mentions

TODO: I am not sure here. We may want to forbid such cases or define canonical ordering even for them.

E.g. `Entity=(e1)(e2)` vs. `Entity=(e2)(e1)`.

**Rule6**: ordering of same-start same-end multiword mentions

TODO: I am not sure here.

These can be either same-span multiword mentions (which may be forbidden)

or something like

```

1 w1 ... Entity=(e1(e2[1/2])

2 w2 ...

3 w3 ... Entity=(e2[2/2])e1)

```

where both `e1` and `e2` start at w1 and end at w3, but `e2` is discontinuous and does not contain w2.

If we interpret "shorter" and "longer" in Rule2 and Rule3 as `len(mention.words)`

(and not as `mention.words[-1].ord - mention.words[0].ord`),

we get the canonical ordering as in the example above.

"""

import re

import functools

import collections

import collections.abc

import copy

import logging

import bisect

@functools.total_ordering

class CorefMention(object):

"""Class for representing a mention (instance of an entity)."""

__slots__ = ['_head', '_entity', '_bridging', '_words', '_other']

def __init__(self, words, head=None, entity=None, add_word_backlinks=True):

if not words:

raise ValueError("mention.words must be non-empty")

self._head = head if head else words[0]

self._entity = entity

if entity is not None:

entity._mentions.append(self)

self._bridging = None

self._other = None

self._words = words

if add_word_backlinks:

for new_word in words:

if not new_word._mentions or not entity or self > new_word._mentions[-1]:

new_word._mentions.append(self)

else:

new_word._mentions.append(self)

new_word._mentions.sort()

def _subspans(self):

mspan = self.span

if ',' not in mspan:

return [CorefMentionSubspan(self._words, self, '')]

root = self._words[0].root

subspans = mspan.split(',')

result = []

for idx,subspan in enumerate(subspans, 1):

result.append(CorefMentionSubspan(span_to_nodes(root, subspan), self, f'[{idx}/{len(subspans)}]'))

return result

def __lt__(self, another):

"""Does this mention precedes (word-order wise) `another` mention?

This method defines a total ordering of all mentions

(within one entity or across different entities).

The position is primarily defined by the first word in each mention.

If two mentions start at the same word,

their order is defined by their length (i.e. number of words)

-- the shorter mention follows the longer one.

In the rare case of two same-length mentions starting at the same word, but having different spans,

their order is defined by the order of the last word in their span.

For example <w1, w2> precedes <w1, w3>.

The order of two same-span mentions is currently defined by their eid.

There should be no same-span (or same-subspan) same-entity mentions.

"""

#TODO: no mention.words should be handled already when loading

if not self._words:

self._words = [self._head]

if not another._words:

another._words = [another._head]

if self._words[0] is another._words[0]:

if len(self._words) > len(another._words):

return True

if len(self._words) < len(another._words):

return False

if self._words[-1].precedes(another._words[-1]):

return True

if another._words[-1].precedes(self._words[-1]):

return False

return self._entity.eid < another._entity.eid

return self._words[0].precedes(another._words[0])

@property

def other(self):

if self._other is None:

self._other = OtherDualDict()

return self._other

@other.setter

def other(self, value):

if self._other is None:

self._other = OtherDualDict(value)

else:

self._other.set_mapping(value)

@property

def head(self):

return self._head

@head.setter

def head(self, new_head):

if self._words and new_head not in self._words:

raise ValueError(f"New head {new_head} not in mention words")

self._head = new_head

@property

def entity(self):

return self._entity

@entity.setter

def entity(self, new_entity):

if self._entity is not None:

original_entity = self._entity

original_entity._mentions.remove(self)

if not original_entity._mentions:

logging.warning(f"Original entity {original_entity.eid} is now empty.")

self._entity = new_entity

bisect.insort(new_entity._mentions, self)

@property

def bridging(self):

if not self._bridging:

self._bridging = BridgingLinks(self)

return self._bridging

# TODO add/edit bridging

@property

def words(self):

# Words in a sentence could have been reordered, so we cannot rely on sorting self._words in the setter.

# The serialization relies on storing the opening bracket in the first word (and closing in the last),

# so we need to make sure the words are always returned sorted.

# TODO: benchmark updating the order of mention._words in node.shift_*() and node.remove().

self._words.sort()

return self._words

@words.setter

def words(self, new_words):

if new_words and self.head not in new_words:

raise ValueError(f"Head {self.head} not in new_words {new_words} for {self._entity.eid}")

kept_words = []

# Make sure each word is included just once and they are in the correct order.

new_words = sorted(list(set(new_words)))

for old_word in self._words:

if old_word in new_words:

kept_words.append(old_word)

else:

old_word._mentions.remove(self)

self._words = new_words

for new_word in new_words:

if new_word not in kept_words:

if not new_word._mentions or self > new_word._mentions[-1]:

new_word._mentions.append(self)

else:

new_word._mentions.append(self)

new_word._mentions.sort()

@property

def span(self):

return nodes_to_span(self._words)

@span.setter

def span(self, new_span):

self.words = span_to_nodes(self._head.root, new_span)

def __str__(self):

"""String representation of the CorefMention object: Mention<m.entity.eid: m.head>."""

return f"Mention<{self._entity._eid}: {self._head}>"

def remove(self):

for word in self._words:

word._mentions.remove(self)

self._entity._mentions.remove(self)

@functools.total_ordering

class CorefMentionSubspan(object):

"""Helper class for representing a continuous subspan of a mention."""

__slots__ = ['words', 'mention', 'subspan_id']

def __init__(self, words, mention, subspan_id):

if not words:

raise ValueError("mention.words must be non-empty")

self.words = sorted(words)

self.mention = mention

self.subspan_id = subspan_id

def __lt__(self, another):

if self.words[0] is another.words[0]:

if len(self.words) > len(another.words):

return True

if len(self.words) < len(another.words):

return False

return self.mention < another.mention

return self.words[0].precedes(another.words[0])

@property

def subspan_eid(self):

return self.mention._entity.eid + self.subspan_id

CHARS_FORBIDDEN_IN_ID = "-=| \t()"

@functools.total_ordering

class CorefEntity(object):

"""Class for representing all mentions of a given entity."""

__slots__ = ['_eid', '_mentions', 'etype', 'split_ante']

def __init__(self, eid, etype=None):

self._eid = None # prepare the _eid slot

self.eid = eid # call the setter and check the ID is valid

self._mentions = []

self.etype = etype

self.split_ante = []

def __lt__(self, another):

"""Does this CorefEntity precede (word-order wise) `another` entity?

This method defines a total ordering of all entities

by the first mention of each entity (see `CorefMention.__lt__`).

If one of the entities has no mentions (which should not happen normally),

there is a backup solution (see the source code).

If entity IDs are not important, it is recommended to use block

`corefud.IndexClusters` to re-name entity IDs in accordance with this entity ordering.

"""

if not self._mentions or not another._mentions:

# Entities without mentions should go first, so the ordering is total.

# If both entities are missing mentions, let's use eid, so the ordering is stable.

if not self._mentions and not another._mentions:

return self._eid < another._eid

return not self._mentions

return self._mentions[0] < another._mentions[0]

@property

def eid(self):

return self._eid

@eid.setter

def eid(self, new_eid):

if any(x in new_eid for x in CHARS_FORBIDDEN_IN_ID):

raise ValueError(f"{new_eid} contains forbidden characters [{CHARS_FORBIDDEN_IN_ID}]")

self._eid = new_eid

@property

def eid_or_grp(self):

root = self._mentions[0].head.root

meta = root.document.meta

if 'GRP' in meta['global.Entity'] and meta['_tree2docid']:

docid = meta['_tree2docid'][root]

if self._eid.startswith(docid):

return self._eid.replace(docid, '', 1)

else:

logging.warning(f"GRP in global.Entity, but eid={self._eid} does not start with docid={docid}")

return self._eid

@property

def mentions(self):

return self._mentions

def create_mention(self, head=None, words=None, span=None):

"""Create a new CoreferenceMention object within this CorefEntity.

Args:

head: a node where the annotation about this CorefMention will be stored in MISC.

The head is supposed to be the linguistic head of the mention,

i.e. the highest node in the dependency tree,

but if such information is not available (yet),

it can be any node within the `words`.

If no head is specified, the first word from `words` will be used instead.

words: a list of nodes of the mention.

This argument is optional, but if provided, it must contain the head.

The nodes can be both normal nodes or empty nodes.

span: an alternative way how to specify `words`

using a string such as "3-5,6,7.1-7.2".

(which means, there is an empty node 5.1 and normal node 7,

which are not part of the mention).

At most one of the args `words` and `span` can be specified.

"""

if words and span:

raise ValueError("Cannot specify both words and span")

if head and words and head not in words:

raise ValueError(f"Head {head} is not among the specified words")

if head is None and words is None:

raise ValueError("Either head or words must be specified")

if head is None:

head = words[0]

mention = CorefMention(words=[head], head=head, entity=self)

if words:

mention.words = words

if span:

mention.span = span

self._mentions.sort()

return mention

# TODO or should we create a BridgingLinks instance with a fake src_mention?

def all_bridging(self):

for m in self._mentions:

if m._bridging:

for b in m._bridging:

yield b

def __str__(self):

"""String representation of the CorefEntity object: Entity<e.eid: m.head>."""

first_mention_head = self._mentions[0].head.form if self._mentions else ""

return f"Entity<{self._eid}: {first_mention_head}>"

# BridgingLink

# Especially the relation should be mutable, so we cannot use

# BridgingLink = collections.namedtuple('BridgingLink', 'target relation')

# TODO once dropping support for Python 3.6, we could use

# from dataclasses import dataclass

# @dataclass

# class DataClassCard:

# target: CorefEntity

# relation: str

class BridgingLink:

__slots__ = ['target', 'relation']

def __init__(self, target, relation=''):

self.target = target

self.relation = '' if relation is None else relation

def __lt__(self, another):

if self.target == another.target:

return self.relation < another.relation

return self.target < another.target

class BridgingLinks(collections.abc.MutableSequence):

"""BridgingLinks class serves as a list of BridgingLink tuples with additional methods.

Example usage:

>>> bl = BridgingLinks(src_mention) # empty links

>>> bl = BridgingLinks(src_mention, [(c12, 'part'), (c56, 'subset')]) # from a list of tuples

>>> (bl8, bl9) = BridgingLinks.from_string('c12<c8:part,c56<c8:subset,c5<c9', entities)

>>> for entity, relation in bl:

>>> print(f"{bl.src_mention} ->{relation}-> {entity.eid}")

>>> print(str(bl)) # c12<c8:part,c56<c8:subset

>>> bl('part').targets == [c12]

>>> bl('part|subset').targets == [c12, c56]

>>> bl.append((c57, 'funct'))

"""

@classmethod

def from_string(cls, string, entities, node, strict=True, tree2docid=None):

"""Return a sequence of BridgingLink objects representing a given string serialization.

The bridging links are also added to the mentions (`mention.bridging`) in the supplied `entities`,

so the returned sequence can be usually ignored.

If `tree2docid` parameter is provided (mapping trees to document IDs used as prefixes in eid),

the entity IDs in the provided string are interpreted as "GRP", i.e. as document-wide IDs,

which need to be prefixed by the document IDs, to get corpus-wide unique "eid".

"""

src_str2bl = {}

for link_str in string.split(','):

try:

trg_str, src_str = link_str.split('<')

except ValueError as err:

_error(f"invalid Bridge {link_str} {err} at {node}", strict)

continue

relation = ''

if ':' in src_str:

src_str, relation = src_str.split(':', 1)

if trg_str == src_str:

_error(f"Bridge cannot self-reference the same entity {trg_str} at {node}", strict)

if tree2docid:

src_str = tree2docid[node.root] + src_str

trg_str = tree2docid[node.root] + trg_str

bl = src_str2bl.get(src_str)

if not bl:

bl = entities[src_str].mentions[-1].bridging

src_str2bl[src_str] = bl

if trg_str not in entities:

entities[trg_str] = CorefEntity(trg_str)

bl._data.append(BridgingLink(entities[trg_str], relation))

return src_str2bl.values()

def __init__(self, src_mention, value=None, strict=True):

self.src_mention = src_mention

self._data = []

self.strict = strict

if value is not None:

if isinstance(value, collections.abc.Sequence):

for v in value:

if v[0] is src_mention._entity:

_error("Bridging cannot self-reference the same entity: " + v[0].eid, strict)

self._data.append(BridgingLink(v[0], v[1]))

else:

raise ValueError(f"Unknown value type: {type(value)}")

self.src_mention._bridging = self

super().__init__()

def __getitem__(self, key):

return self._data[key]

def __len__(self):

return len(self._data)

# TODO delete backlinks of old links, dtto for SplitAnte

def __setitem__(self, key, new_value):

if new_value[0] is self.src_mention._entity:

_error("Bridging cannot self-reference the same entity: " + new_value[0].eid, self.strict)

self._data[key] = BridgingLink(new_value[0], new_value[1])

def __delitem__(self, key):

del self._data[key]

def insert(self, key, new_value):

if new_value[0] is self.src_mention._entity:

_error("Bridging cannot self-reference the same entity: " + new_value[0].eid, self.strict)

self._data.insert(key, BridgingLink(new_value[0], new_value[1]))

def __str__(self):

# TODO in future link.relation should never be None, 0 nor "_", so we could delete the <not in (None, "_", "")> below.

return ','.join(f'{l.target.eid_or_grp}<{self.src_mention.entity.eid_or_grp}{":" + l.relation if l.relation not in (None, "_", "") else ""}' for l in sorted(self._data))

def __call__(self, relations_re=None):

"""Return a subset of links contained in this list as specified by the args.

Args:

relations: only links with a relation matching this regular expression will be returned

"""

if relations_re is None:

return self

return BridgingLinks(self.src_mention, [l for l in self._data if re.match(relations_re, l.relation)])

@property

def targets(self):

"""Return a list of the target entities (without relations)."""

return [link.target for link in self._data]

def _delete_targets_without_mentions(self, warn=True):

for link in self._data:

if not link.target.mentions:

if warn:

logging.warning(f"Entity {link.target.eid} has no mentions, but is referred to in bridging of {self.src_mention.entity.eid}")

self._data.remove(link)

def _error(msg, strict):

if strict:

raise ValueError(msg)

logging.error(msg)

RE_DISCONTINUOUS = re.compile(r'^([^[]+)\[(\d+)/(\d+)\]')

# When converting doc-level GRP IDs to corpus-level eid IDs,

# we need to assign each document a short ID/number (document names are too long).

# These document numbers must be unique even when loading multiple files,

# so we need to store the highest number generated so far here, at the Python module level.

highest_doc_n = 0

def load_coref_from_misc(doc, strict=True):

global highest_doc_n

entities = {}

unfinished_mentions = collections.defaultdict(list)

discontinuous_mentions = collections.defaultdict(list)

global_entity = doc.meta.get('global.Entity')

was_global_entity = True

if not global_entity:

was_global_entity = False

global_entity = 'eid-etype-head-other'

doc.meta['global.Entity'] = global_entity

tree2docid = None

if 'GRP' in global_entity:

tree2docid, docid = {}, ""

for bundle in doc:

for tree in bundle:

if tree.newdoc or docid == "":

highest_doc_n += 1

docid = f"d{highest_doc_n}."

tree2docid[tree] = docid

doc.meta['_tree2docid'] = tree2docid

elif 'eid' not in global_entity:

raise ValueError("No eid in global.Entity = " + global_entity)

fields = global_entity.split('-')

for node in doc.nodes_and_empty:

misc_entity = node.misc["Entity"]

if not misc_entity:

continue

if not was_global_entity:

raise ValueError(f"No global.Entity header found, but Entity= annotations are presents")

# The Entity attribute may contain multiple entities, e.g.

# Entity=(abstract-7-new-2-coref(abstract-3-giv:act-1-coref)

# means a start of entity id=7 and start&end (i.e. single-word mention) of entity id=3.

# The following re.split line splits this into

# chunks = ["(abstract-7-new-2-coref", "(abstract-3-giv:act-1-coref)"]

chunks = [x for x in re.split(r'(\([^()]+\)?|[^()]+\))', misc_entity) if x]

for chunk in chunks:

opening, closing = (chunk[0] == '(', chunk[-1] == ')')

chunk = chunk.strip('()')

# 1. invalid

if not opening and not closing:

logging.warning(f"Entity {chunk} at {node} has no opening nor closing bracket.")

# 2. closing bracket

elif not opening and closing:

# closing brackets should include just the ID, but GRP needs to be converted to eid

if tree2docid:

# TODO delete this legacy hack once we don't need to load UD GUM v2.8 anymore

if '-' in chunk:

if not strict and global_entity.startswith('entity-GRP'):

chunk = chunk.split('-')[1]

else:

_error("Unexpected closing eid " + chunk, strict)

chunk = tree2docid[node.root] + chunk

# closing discontinuous mentions

eid, subspan_idx = chunk, None

if chunk not in unfinished_mentions:

m = RE_DISCONTINUOUS.match(chunk)

if not m:

raise ValueError(f"Mention {chunk} closed at {node}, but not opened.")

eid, subspan_idx, total_subspans = m.group(1, 2, 3)

try:

mention, head_idx = unfinished_mentions[eid].pop()

except IndexError as err:

raise ValueError(f"Mention {chunk} closed at {node}, but not opened.")

last_word = mention.words[-1]

if node.root is not last_word.root:

# TODO cross-sentence mentions

if strict:

raise ValueError(f"Cross-sentence mentions not supported yet: {chunk} at {node}")

else:

logging.warning(f"Cross-sentence mentions not supported yet: {chunk} at {node}. Deleting.")

entity = mention.entity

mention.words = []

entity._mentions.remove(mention)

if not entity._mentions:

del entities[entity.eid]

for w in node.root.descendants_and_empty:

if last_word.precedes(w):

mention._words.append(w)

w._mentions.append(mention)

if w is node:

break

if head_idx and (subspan_idx is None or subspan_idx == total_subspans):

try:

mention.head = mention.words[head_idx - 1]

except IndexError as err:

_error(f"Invalid head_idx={head_idx} for {mention.entity.eid} "

f"closed at {node} with words={mention.words}", strict)

if not strict and head_idx > len(mention.words):

mention.head = mention.words[-1]

if subspan_idx and subspan_idx == total_subspans:

m = discontinuous_mentions[eid].pop()

if m is not mention:

_error(f"Closing mention {mention.entity.eid} at {node}, but it has unfinished nested mentions ({m.words})", 1)

# 3. opening or single-word

else:

eid, etype, head_idx, other = None, None, None, OtherDualDict()

for name, value in zip(fields, chunk.split('-')):

if name == 'eid':

eid = value

elif name == 'GRP':

eid = tree2docid[node.root] + value

elif name == 'etype' or name == 'entity': # entity is an old name for etype used in UD GUM 2.8 and 2.9

etype = value

elif name == 'head':

try:

head_idx = int(value)

except ValueError as err:

_error(f"Non-integer {value} as head index in {chunk} in {node}: {err}", strict)

head_idx = 1

elif name == 'other':

if other:

new_other = OtherDualDict(value)

for k,v in other.values():

new_other[k] = v

other = new_other

else:

other = OtherDualDict(value)

else:

other[name] = value

if eid is None:

raise ValueError("No eid in " + chunk)

subspan_idx, total_subspans = None, '0'

if eid[-1] == ']':

m = RE_DISCONTINUOUS.match(eid)

if not m:

_error(f"eid={eid} ending with ], but not valid discontinuous mention ID ", strict)

else:

eid, subspan_idx, total_subspans = m.group(1, 2, 3)

entity = entities.get(eid)

if entity is None:

if subspan_idx and subspan_idx != '1':

_error(f'Non-first subspan of a discontinuous mention {eid} at {node} does not have any previous mention.', 1)

entity = CorefEntity(eid)

entities[eid] = entity

entity.etype = etype

elif etype and entity.etype and entity.etype != etype:

logging.warning(f"etype mismatch in {node}: {entity.etype} != {etype}")

other["orig_etype"] = etype

# CorefEntity could be created first with "Bridge=" without any type

elif etype and entity.etype is None:

entity.etype = etype

if subspan_idx and subspan_idx != '1':

opened = [pair[0] for pair in unfinished_mentions[eid]]

mention = next(m for m in discontinuous_mentions[eid] if m not in opened)

mention._words.append(node)

if closing and subspan_idx == total_subspans:

m = discontinuous_mentions[eid].pop()

if m is not mention:

_error(f"{node}: closing mention {mention.entity.eid} ({mention.words}), but it has an unfinished nested mention ({m.words})", 1)

try:

mention.head = mention._words[head_idx - 1]

except IndexError as err:

_error(f"Invalid head_idx={head_idx} for {mention.entity.eid} "

f"closed at {node} with words={mention._words}", 1)

else:

mention = CorefMention(words=[node], entity=entity, add_word_backlinks=False)

if other:

mention._other = other

if subspan_idx:

discontinuous_mentions[eid].append(mention)

node._mentions.append(mention)

if not closing:

unfinished_mentions[eid].append((mention, head_idx))

# Bridge, e.g. Entity=(e12-event|Bridge=e12<e124,e12<e125

# or with relations Bridge=e173<c188:subset,e174<e188:part

misc_bridge = node.misc['Bridge']

if misc_bridge:

BridgingLinks.from_string(misc_bridge, entities, node, strict, tree2docid)

# SplitAnte, e.g. Entity=(e11-person(e12-person)|SplitAnte=e3<e11,e4<e11,e6<e12,e7<e12

# which means that both e11 and e12 have split antecedents (e11=e3+e4, e12=e6+e7).

misc_split = node.misc['SplitAnte']

if not misc_split and 'Split' in node.misc:

misc_split = node.misc.pop('Split')

if misc_split:

ante_entities = []

for x in misc_split.split(','):

ante_str, this_str = x.split('<')

if ante_str == this_str:

_error("SplitAnte cannot self-reference the same entity: " + this_str, strict)

if tree2docid:

ante_str = tree2docid[node.root] + ante_str

this_str = tree2docid[node.root] + this_str

# split cataphora, e.g. "We, that is you and me..."

if ante_str not in entities:

entities[ante_str] = CorefEntity(ante_str)

entities[this_str].split_ante.append(entities[ante_str])

for eid, mentions in unfinished_mentions.items():

for mention, head_idx in mentions:

logging.warning(f"Mention {eid} opened at {mention.head}, but not closed. Deleting.")

entity = mention.entity

mention.words = []

entity._mentions.remove(mention)

if not entity._mentions:

del entities[eid]

# c=doc.coref_entities should be sorted, so that c[0] < c[1] etc.

# In other words, the dict should be sorted by the values (according to CorefEntity.__lt__),

# not by the keys (eid).

# In Python 3.7+ (3.6+ in CPython), dicts are guaranteed to be insertion order.

for entity in entities.values():

if not entity._mentions:

_error(f"Entity {entity.eid} referenced in SplitAnte or Bridge, but not defined with Entity", strict)

entity._mentions.sort()

for mention in entity._mentions:

for node in mention._words:

node._mentions.sort()

doc._eid_to_entity = {c._eid: c for c in sorted(entities.values())}

def store_coref_to_misc(doc):

if not doc._eid_to_entity:

return

tree2docid = doc.meta.get('_tree2docid')

global_entity = doc.meta.get('global.Entity')

if not global_entity:

global_entity = 'eid-etype-head-other'

doc.meta['global.Entity'] = global_entity

# global.Entity won't be written without newdoc

if not doc[0].trees[0].newdoc:

doc[0].trees[0].newdoc = True

fields = global_entity.split('-')

# GRP and entity are legacy names for eid and etype, respectively.

other_fields = [f for f in fields if f not in ('eid etype head other GRP entity'.split(), )]

attrs = "Entity SplitAnte Bridge".split()

for node in doc.nodes_and_empty:

for attr in attrs:

del node.misc[attr]

# Convert each subspan of each discontinuous mention into a fake CorefMention instance,

# so that we can sort both real and fake mentions and process them in the correct order.

doc_mentions = []

for mention in doc.coref_mentions:

if ',' not in mention.span:

doc_mentions.append(mention)

else:

entity = mention.entity

head_str = str(mention.words.index(mention.head) + 1)

subspans = mention.span.split(',')

root = mention.words[0].root

for idx,subspan in enumerate(subspans, 1):

eid = entity.eid

if tree2docid and 'GRP' in fields:

eid = re.sub(r'^d\d+\.', '', eid) # TODO or "eid = entity.eid_or_grp"?

subspan_eid = f'{eid}[{idx}/{len(subspans)}]'

subspan_words = span_to_nodes(root, subspan)

fake_entity = CorefEntity(subspan_eid, entity.etype)

fake_mention = CorefMention(subspan_words, head_str, fake_entity, add_word_backlinks=False)

if mention._other:

fake_mention._other = mention._other

if mention._bridging and idx == 1:

fake_mention._bridging = mention._bridging

doc_mentions.append(fake_mention)

doc_mentions.sort()

for mention in doc_mentions:

entity = mention.entity

values = []

for field in fields:

if field == 'eid' or field == 'GRP':

eid = entity.eid

if field == 'GRP':

eid = re.sub(r'^d\d+\.', '', eid)

if any(x in eid for x in CHARS_FORBIDDEN_IN_ID):

_error(f"{eid} contains forbidden characters [{CHARS_FORBIDDEN_IN_ID}]", strict)

for c in CHARS_FORBIDDEN_IN_ID:

eid = eid.replace(c, '')

values.append(eid)

elif field == 'etype' or field == 'entity':

if not entity.etype:

values.append('')

else:

values.append(entity.etype)

elif field == 'head':

if isinstance(mention.head, str):

values.append(mention.head) # fake mention for discontinuous spans

else:

values.append(str(mention.words.index(mention.head) + 1))

elif field == 'other':

if not mention._other:

values.append('')

elif not other_fields:

values.append(str(mention.other))

else:

other_copy = OtherDualDict(mention.other)

for other_field in other_fields:

del other_copy[other_field]

values.append(str(other_copy))

elif field == 'identity':

values.append(mention.other[field]) # don't replace('%2C', ',') in wikification

else:

values.append(mention.other[field].replace('%2C', ',')) # but de-escape commas e.g. in minspan

# optional fields

while values and values[-1] == '':

del values[-1]

mention_str = '(' + '-'.join(values)

# First, handle single-word mentions.

# If there are no opening brackets (except for single-word),

# single-word mentions should precede all closing brackets, e.g. `Entity=(e10)(e9)e4)e3)`.

# Otherwise, single-word mentions should follow all opening brackets,

# e.g. `Entity=(e1(e2(e9)(e10)` or `Entity=e4)e3)(e1(e2(e9)(e10)`.

firstword = mention.words[0]

if len(mention.words) == 1:

orig_entity = firstword.misc['Entity']

# empty --> (e10)

# (e1(e2 --> (e1(e2(e10)

# e3)(e1(e2 --> e3)(e1(e2(e10)

if not orig_entity or orig_entity[-1] != ')':

firstword.misc['Entity'] += mention_str + ')'

# e4)e3) --> (e10)e4)e3)

elif '(' not in orig_entity:

firstword.misc['Entity'] = mention_str + ')' + orig_entity

# (e9)e4)e3) --> (e10)(e9)e4)e3)

elif any(c and c[0] == '(' and c[-1] != ')' for c in re.split(r'(\([^()]+\)?|[^()]+\))', orig_entity)):

firstword.misc['Entity'] += mention_str + ')'

# (e1(e2(e9) --> (e1(e2(e9)(e10)

# e3)(e1(e2(e9)--> e3)(e1(e2(e9)(e10)

else:

firstword.misc['Entity'] = mention_str + ')' + orig_entity

# Second, multi-word mentions. Opening brackets should follow closing brackets.

else:

firstword.misc['Entity'] += mention_str

eid = entity.eid

if tree2docid and 'GRP' in fields:

eid = re.sub(r'^d\d+\.', '', eid)

mention.words[-1].misc['Entity'] = eid + ')' + mention.words[-1].misc['Entity']

# Bridge=e1<e5:subset,e2<e6:subset|Entity=(e5(e6

if mention._bridging:

mention._bridging._delete_targets_without_mentions()

str_bridge = str(mention._bridging)

if firstword.misc['Bridge']:

str_bridge = firstword.misc['Bridge'] + ',' + str_bridge

firstword.misc['Bridge'] = str_bridge

# SplitAnte=e5<e61,e10<e61

for entity in doc.coref_entities:

if entity.split_ante:

for ante_entity in entity.split_ante:

if not ante_entity.mentions:

logging.warning(f"Entity {ante_entity.eid} has no mentions, but is referred to in SplitAnte of {entity.eid}")

entity.split_ante.remove(ante_entity)

if not entity.split_ante or len(entity.split_ante) < 2:

logging.warning(f"SplitAnte of {entity.eid} has less than two antecedents, omitting")

continue

first_word = entity.mentions[0].words[0]

if tree2docid:

strs = ','.join(f'{sa.eid_or_grp}<{entity.eid_or_grp}' for sa in entity.split_ante)

else:

strs = ','.join(f'{sa.eid}<{entity.eid}' for sa in entity.split_ante)

if first_word.misc['SplitAnte']:

strs = first_word.misc['SplitAnte'] + ',' + strs

first_word.misc['SplitAnte'] = strs

def span_to_nodes(root, span):

ranges = []

for span_str in span.split(','):

try:

if '-' not in span_str:

lo = hi = float(span_str)

else:

lo, hi = (float(x) for x in span_str.split('-'))

except ValueError as e:

raise ValueError(f"Cannot parse '{span}': {e}")

ranges.append((lo, hi))

ranges.sort()

def _num_in_ranges(num):

for (lo, hi) in ranges:

if num < lo:

return False

if num <= hi:

return True

return False

return [w for w in root.descendants_and_empty if _num_in_ranges(w.ord)]

def nodes_to_span(nodes):

"""Converts a list of nodes into a string specifying ranges of their ords.

For example, nodes with ords 3, 4, 5 and 7 will be converted to "3-5,7".

The function handles also empty nodes, so e.g. 3.1, 3.2 and 3.3 will be converted to "3.1-3.3".

Note that empty nodes may form gaps in the span, so if a given tree contains

an empty node with ord 5.1, but only nodes with ords 3, 4, 5, 6, 7.1 and 7.2

are provided as `nodes`, the resulting string will be "3-5,6,7.1-7.2".

This means that the implementation needs to iterate over all nodes

in a given tree (root.descendants_and_empty) to check for such gaps.

"""

if not nodes:

return ''

all_nodes = nodes[0].root.descendants_and_empty

i, found, ranges = -1, 0, []

while i + 1 < len(all_nodes) and found < len(nodes):

i += 1

if all_nodes[i] in nodes:

lo = all_nodes[i].ord

while i < len(all_nodes) and all_nodes[i] in nodes:

i, found = i + 1, found + 1

hi = all_nodes[i - 1].ord

ranges.append(f"{lo}-{hi}" if hi > lo else f"{lo}")

return ','.join(ranges)

# TODO fix code duplication with udapi.core.dualdict after making sure benchmarks are not slower

class OtherDualDict(collections.abc.MutableMapping):

"""OtherDualDict class serves as dict with lazily synchronized string representation.

>>> ddict = OtherDualDict('anacata:anaphoric,antetype:entity,nptype:np')

>>> ddict['mention'] = 'np'

>>> str(ddict)

'anacata:anaphoric,antetype:entity,mention:np,nptype:np'

>>> ddict['NonExistent']

This class provides access to both

* a structured (dict-based, deserialized) representation,

e.g. {'anacata': 'anaphoric', 'antetype': 'entity'}, and

* a string (serialized) representation of the mapping, e.g. `anacata:anaphoric,antetype:entity`.

There is a clever mechanism that makes sure that users can read and write

both of the representations which are always kept synchronized.

Moreover, the synchronization is lazy, so the serialization and deserialization

is done only when needed. This speeds up scenarios where access to dict is not needed.

A value can be deleted with any of the following three ways:

>>> del ddict['nptype']

>>> ddict['nptype'] = None

>>> ddict['nptype'] = ''

and it works even if the value was already missing.

"""

__slots__ = ['_string', '_dict']

def __init__(self, value=None, **kwargs):

if value is not None and kwargs:

raise ValueError('If value is specified, no other kwarg is allowed ' + str(kwargs))

View remainder of file in raw view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

coref.py

Latest commit

History

coref.py

File metadata and controls