Skip to content

Commit e28b8c9

Browse files
bpo-35018: Sax parser should provide user access to lexical handlers (pythonGH-20958)
Co-Authored-By: Jonathan Gossage <jgossage@gmail.com>
1 parent 67acf74 commit e28b8c9

File tree

5 files changed

+264
-9
lines changed

5 files changed

+264
-9
lines changed

Doc/library/xml.sax.handler.rst

Lines changed: 55 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,12 @@
1111

1212
--------------
1313

14-
The SAX API defines four kinds of handlers: content handlers, DTD handlers,
15-
error handlers, and entity resolvers. Applications normally only need to
16-
implement those interfaces whose events they are interested in; they can
17-
implement the interfaces in a single object or in multiple objects. Handler
18-
implementations should inherit from the base classes provided in the module
19-
:mod:`xml.sax.handler`, so that all methods get default implementations.
14+
The SAX API defines five kinds of handlers: content handlers, DTD handlers,
15+
error handlers, entity resolvers and lexical handlers. Applications normally
16+
only need to implement those interfaces whose events they are interested in;
17+
they can implement the interfaces in a single object or in multiple objects.
18+
Handler implementations should inherit from the base classes provided in the
19+
module :mod:`xml.sax.handler`, so that all methods get default implementations.
2020

2121

2222
.. class:: ContentHandler
@@ -47,6 +47,12 @@ implementations should inherit from the base classes provided in the module
4747
application. The methods of this object control whether errors are immediately
4848
converted to exceptions or are handled in some other way.
4949

50+
51+
.. class:: LexicalHandler
52+
53+
Interface used by the parser to represent low freqency events which may not
54+
be of interest to many applications.
55+
5056
In addition to these classes, :mod:`xml.sax.handler` provides symbolic constants
5157
for the feature and property names.
5258

@@ -114,7 +120,7 @@ for the feature and property names.
114120
.. data:: property_lexical_handler
115121

116122
| value: ``"http://xml.org/sax/properties/lexical-handler"``
117-
| data type: xml.sax.sax2lib.LexicalHandler (not supported in Python 2)
123+
| data type: xml.sax.handler.LexicalHandler (not supported in Python 2)
118124
| description: An optional extension handler for lexical events like
119125
comments.
120126
| access: read/write
@@ -413,3 +419,45 @@ the passed-in exception object.
413419
information will continue to be passed to the application. Raising an exception
414420
in this method will cause parsing to end.
415421

422+
423+
.. _lexical-handler-objects:
424+
425+
LexicalHandler Objects
426+
----------------------
427+
Optional SAX2 handler for lexical events.
428+
429+
This handler is used to obtain lexical information about an XML
430+
document. Lexical information includes information describing the
431+
document encoding used and XML comments embedded in the document, as
432+
well as section boundaries for the DTD and for any CDATA sections.
433+
The lexical handlers are used in the same manner as content handlers.
434+
435+
Set the LexicalHandler of an XMLReader by using the setProperty method
436+
with the property identifier
437+
``'http://xml.org/sax/properties/lexical-handler'``.
438+
439+
440+
.. method:: LexicalHandler.comment(content)
441+
442+
Reports a comment anywhere in the document (including the DTD and
443+
outside the document element).
444+
445+
.. method:: LexicalHandler.startDTD(name, public_id, system_id)
446+
447+
Reports the start of the DTD declarations if the document has an
448+
associated DTD.
449+
450+
.. method:: LexicalHandler.endDTD()
451+
452+
Reports the end of DTD declaration.
453+
454+
.. method:: LexicalHandler.startCDATA()
455+
456+
Reports the start of a CDATA marked section.
457+
458+
The contents of the CDATA marked section will be reported through
459+
the characters handler.
460+
461+
.. method:: LexicalHandler.endCDATA()
462+
463+
Reports the end of a CDATA marked section.

Doc/whatsnew/3.10.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,13 @@ Add :data:`sys.orig_argv` attribute: the list of the original command line
139139
arguments passed to the Python executable.
140140
(Contributed by Victor Stinner in :issue:`23427`.)
141141

142+
xml
143+
---
144+
145+
Add a :class:`~xml.sax.handler.LexicalHandler` class to the
146+
:mod:`xml.sax.handler` module.
147+
(Contributed by Jonathan Gossage and Zackery Spytz in :issue:`35018`.)
148+
142149

143150
Optimizations
144151
=============

Lib/test/test_sax.py

Lines changed: 155 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@
1313
from xml.sax.saxutils import XMLGenerator, escape, unescape, quoteattr, \
1414
XMLFilterBase, prepare_input_source
1515
from xml.sax.expatreader import create_parser
16-
from xml.sax.handler import feature_namespaces, feature_external_ges
16+
from xml.sax.handler import (feature_namespaces, feature_external_ges,
17+
LexicalHandler)
1718
from xml.sax.xmlreader import InputSource, AttributesImpl, AttributesNSImpl
1819
from io import BytesIO, StringIO
1920
import codecs
@@ -1356,6 +1357,155 @@ def test_nsattrs_wattr(self):
13561357
self.assertEqual(attrs.getQNameByName((ns_uri, "attr")), "ns:attr")
13571358

13581359

1360+
class LexicalHandlerTest(unittest.TestCase):
1361+
def setUp(self):
1362+
self.parser = None
1363+
1364+
self.specified_version = '1.0'
1365+
self.specified_encoding = 'UTF-8'
1366+
self.specified_doctype = 'wish'
1367+
self.specified_entity_names = ('nbsp', 'source', 'target')
1368+
self.specified_comment = ('Comment in a DTD',
1369+
'Really! You think so?')
1370+
self.test_data = StringIO()
1371+
self.test_data.write('<?xml version="{}" encoding="{}"?>\n'.
1372+
format(self.specified_version,
1373+
self.specified_encoding))
1374+
self.test_data.write('<!DOCTYPE {} [\n'.
1375+
format(self.specified_doctype))
1376+
self.test_data.write('<!-- {} -->\n'.
1377+
format(self.specified_comment[0]))
1378+
self.test_data.write('<!ELEMENT {} (to,from,heading,body,footer)>\n'.
1379+
format(self.specified_doctype))
1380+
self.test_data.write('<!ELEMENT to (#PCDATA)>\n')
1381+
self.test_data.write('<!ELEMENT from (#PCDATA)>\n')
1382+
self.test_data.write('<!ELEMENT heading (#PCDATA)>\n')
1383+
self.test_data.write('<!ELEMENT body (#PCDATA)>\n')
1384+
self.test_data.write('<!ELEMENT footer (#PCDATA)>\n')
1385+
self.test_data.write('<!ENTITY {} "&#xA0;">\n'.
1386+
format(self.specified_entity_names[0]))
1387+
self.test_data.write('<!ENTITY {} "Written by: Alexander.">\n'.
1388+
format(self.specified_entity_names[1]))
1389+
self.test_data.write('<!ENTITY {} "Hope it gets to: Aristotle.">\n'.
1390+
format(self.specified_entity_names[2]))
1391+
self.test_data.write(']>\n')
1392+
self.test_data.write('<{}>'.format(self.specified_doctype))
1393+
self.test_data.write('<to>Aristotle</to>\n')
1394+
self.test_data.write('<from>Alexander</from>\n')
1395+
self.test_data.write('<heading>Supplication</heading>\n')
1396+
self.test_data.write('<body>Teach me patience!</body>\n')
1397+
self.test_data.write('<footer>&{};&{};&{};</footer>\n'.
1398+
format(self.specified_entity_names[1],
1399+
self.specified_entity_names[0],
1400+
self.specified_entity_names[2]))
1401+
self.test_data.write('<!-- {} -->\n'.format(self.specified_comment[1]))
1402+
self.test_data.write('</{}>\n'.format(self.specified_doctype))
1403+
self.test_data.seek(0)
1404+
1405+
# Data received from handlers - to be validated
1406+
self.version = None
1407+
self.encoding = None
1408+
self.standalone = None
1409+
self.doctype = None
1410+
self.publicID = None
1411+
self.systemID = None
1412+
self.end_of_dtd = False
1413+
self.comments = []
1414+
1415+
def test_handlers(self):
1416+
class TestLexicalHandler(LexicalHandler):
1417+
def __init__(self, test_harness, *args, **kwargs):
1418+
super().__init__(*args, **kwargs)
1419+
self.test_harness = test_harness
1420+
1421+
def startDTD(self, doctype, publicID, systemID):
1422+
self.test_harness.doctype = doctype
1423+
self.test_harness.publicID = publicID
1424+
self.test_harness.systemID = systemID
1425+
1426+
def endDTD(self):
1427+
self.test_harness.end_of_dtd = True
1428+
1429+
def comment(self, text):
1430+
self.test_harness.comments.append(text)
1431+
1432+
self.parser = create_parser()
1433+
self.parser.setContentHandler(ContentHandler())
1434+
self.parser.setProperty(
1435+
'http://xml.org/sax/properties/lexical-handler',
1436+
TestLexicalHandler(self))
1437+
source = InputSource()
1438+
source.setCharacterStream(self.test_data)
1439+
self.parser.parse(source)
1440+
self.assertEqual(self.doctype, self.specified_doctype)
1441+
self.assertIsNone(self.publicID)
1442+
self.assertIsNone(self.systemID)
1443+
self.assertTrue(self.end_of_dtd)
1444+
self.assertEqual(len(self.comments),
1445+
len(self.specified_comment))
1446+
self.assertEqual(f' {self.specified_comment[0]} ', self.comments[0])
1447+
1448+
1449+
class CDATAHandlerTest(unittest.TestCase):
1450+
def setUp(self):
1451+
self.parser = None
1452+
self.specified_chars = []
1453+
self.specified_chars.append(('Parseable character data', False))
1454+
self.specified_chars.append(('<> &% - assorted other XML junk.', True))
1455+
self.char_index = 0 # Used to index specified results within handlers
1456+
self.test_data = StringIO()
1457+
self.test_data.write('<root_doc>\n')
1458+
self.test_data.write('<some_pcdata>\n')
1459+
self.test_data.write(f'{self.specified_chars[0][0]}\n')
1460+
self.test_data.write('</some_pcdata>\n')
1461+
self.test_data.write('<some_cdata>\n')
1462+
self.test_data.write(f'<![CDATA[{self.specified_chars[1][0]}]]>\n')
1463+
self.test_data.write('</some_cdata>\n')
1464+
self.test_data.write('</root_doc>\n')
1465+
self.test_data.seek(0)
1466+
1467+
# Data received from handlers - to be validated
1468+
self.chardata = []
1469+
self.in_cdata = False
1470+
1471+
def test_handlers(self):
1472+
class TestLexicalHandler(LexicalHandler):
1473+
def __init__(self, test_harness, *args, **kwargs):
1474+
super().__init__(*args, **kwargs)
1475+
self.test_harness = test_harness
1476+
1477+
def startCDATA(self):
1478+
self.test_harness.in_cdata = True
1479+
1480+
def endCDATA(self):
1481+
self.test_harness.in_cdata = False
1482+
1483+
class TestCharHandler(ContentHandler):
1484+
def __init__(self, test_harness, *args, **kwargs):
1485+
super().__init__(*args, **kwargs)
1486+
self.test_harness = test_harness
1487+
1488+
def characters(self, content):
1489+
if content != '\n':
1490+
h = self.test_harness
1491+
t = h.specified_chars[h.char_index]
1492+
h.assertEqual(t[0], content)
1493+
h.assertEqual(t[1], h.in_cdata)
1494+
h.char_index += 1
1495+
1496+
self.parser = create_parser()
1497+
self.parser.setContentHandler(TestCharHandler(self))
1498+
self.parser.setProperty(
1499+
'http://xml.org/sax/properties/lexical-handler',
1500+
TestLexicalHandler(self))
1501+
source = InputSource()
1502+
source.setCharacterStream(self.test_data)
1503+
self.parser.parse(source)
1504+
1505+
self.assertFalse(self.in_cdata)
1506+
self.assertEqual(self.char_index, 2)
1507+
1508+
13591509
def test_main():
13601510
run_unittest(MakeParserTest,
13611511
ParseTest,
@@ -1368,7 +1518,10 @@ def test_main():
13681518
StreamReaderWriterXmlgenTest,
13691519
ExpatReaderTest,
13701520
ErrorReportingTest,
1371-
XmlReaderTest)
1521+
XmlReaderTest,
1522+
LexicalHandlerTest,
1523+
CDATAHandlerTest)
1524+
13721525

13731526
if __name__ == "__main__":
13741527
test_main()

Lib/xml/sax/handler.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,3 +340,48 @@ def resolveEntity(self, publicId, systemId):
340340
property_xml_string,
341341
property_encoding,
342342
property_interning_dict]
343+
344+
345+
class LexicalHandler:
346+
"""Optional SAX2 handler for lexical events.
347+
348+
This handler is used to obtain lexical information about an XML
349+
document, that is, information about how the document was encoded
350+
(as opposed to what it contains, which is reported to the
351+
ContentHandler), such as comments and CDATA marked section
352+
boundaries.
353+
354+
To set the LexicalHandler of an XMLReader, use the setProperty
355+
method with the property identifier
356+
'http://xml.org/sax/properties/lexical-handler'."""
357+
358+
def comment(self, content):
359+
"""Reports a comment anywhere in the document (including the
360+
DTD and outside the document element).
361+
362+
content is a string that holds the contents of the comment."""
363+
364+
def startDTD(self, name, public_id, system_id):
365+
"""Report the start of the DTD declarations, if the document
366+
has an associated DTD.
367+
368+
A startEntity event will be reported before declaration events
369+
from the external DTD subset are reported, and this can be
370+
used to infer from which subset DTD declarations derive.
371+
372+
name is the name of the document element type, public_id the
373+
public identifier of the DTD (or None if none were supplied)
374+
and system_id the system identfier of the external subset (or
375+
None if none were supplied)."""
376+
377+
def endDTD(self):
378+
"""Signals the end of DTD declarations."""
379+
380+
def startCDATA(self):
381+
"""Reports the beginning of a CDATA marked section.
382+
383+
The contents of the CDATA marked section will be reported
384+
through the characters event."""
385+
386+
def endCDATA(self):
387+
"""Reports the end of a CDATA marked section."""
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Add the :class:`xml.sax.handler.LexicalHandler` class that is present in
2+
other SAX XML implementations.

0 commit comments

Comments
 (0)