|
| 1 | +""" |
| 2 | +Module to parse a Matroska EBML file. |
| 3 | +
|
| 4 | +* Overview |
| 5 | +
|
| 6 | +An EBML file is a sequence of EBML Elements one after another. An Element |
| 7 | +consists of a two-part header encoding the Element ID and its data size, |
| 8 | +followed by that many bytes of data. The Matroska specification defines some |
| 9 | +number of EBML IDs, which can be found in a Matroska project file called |
| 10 | +specdata.xml. Each defined ID has a human-readable name, e.g. 'Segment'. The |
| 11 | +semantics of the data depends on the Element type. EBML defines the following |
| 12 | +primitive Element types: |
| 13 | +
|
| 14 | + + Master: the data is a sequence of child Elements. |
| 15 | + + Unsigned, Signed: the data is an integer in big-endian form. |
| 16 | + + String, Unicode: the data is a string encoded in ascii or utf-8. |
| 17 | + + Float: the data is a 4-byte or 8-byte floating point number. |
| 18 | + + Date: the data is an 8-byte signed integer representing the number of |
| 19 | + nanoseconds since the Matroska epoch. |
| 20 | + + Binary: the data is opaque. |
| 21 | +
|
| 22 | +The Matroska EBML Element specifications are stored in a special-purpose |
| 23 | +dictionary called MATROSKA_TAGS which is used to create Elements of the |
| 24 | +appropriate class when reading them from a stream. |
| 25 | +
|
| 26 | +This module defines the following classes for reading, storing, editing, and |
| 27 | +writing the data in a Matroska file. |
| 28 | +
|
| 29 | + + Header: stores and manipulates the Element header. |
| 30 | + + Container: stores child Elements. This is subclassed by ElementMaster and |
| 31 | + File. As File is not an Element -- it has no header -- neither is Container. |
| 32 | + + File: facilitates reading and writing Elements from a stream. |
| 33 | + + Element: base class for all EBML Elements. |
| 34 | +
|
| 35 | +Immediate subclasses of Element: |
| 36 | +
|
| 37 | + + ElementMaster: Inherits both Element and Container. |
| 38 | + + ElementAtomic: Base class for all kinds of Elements that actually know how to |
| 39 | + interpret their data. Subclassed by ElementUnsigned, ElementUnicode, etc. |
| 40 | + + ElementVoid: Element that ignores its data on read and writes undefined |
| 41 | + values. |
| 42 | + + ElementUnsupported: An element this module does not support. It cannot be |
| 43 | + resized or written. |
| 44 | +
|
| 45 | +This module provides the Parsed descriptor, which is a convenience class that |
| 46 | +allows Master Elements to read and write the data in child Elements using |
| 47 | +attributes. For instance, the ElementInfo class has the segment_uid attribute; |
| 48 | +if info is an instance of ElementInfo then info.segment_uid reads and writes the |
| 49 | +value of its child SegmentUID. If no such child exists, reading |
| 50 | +info.segment_uid returns a default value, and setting it creates the child. |
| 51 | +This is much easier to use than, say, |
| 52 | +
|
| 53 | + uid_elements = list(info.children_named('SegmentUID')) |
| 54 | + if uid_elements: |
| 55 | + return uid_elements[-1].value |
| 56 | + else |
| 57 | + return default_value |
| 58 | +
|
| 59 | +The ElementSegment class takes advantage of Parsed descriptors to give easy |
| 60 | +access to the segment metadata. Classes using this facility: ElementEBML, |
| 61 | +ElementSegment, ElementSeek, ElementInfo, ElementTrackEntry, ElementVideo, |
| 62 | +ElementAudio, ElementAttachedFile, etc. |
| 63 | +
|
| 64 | +* Reading |
| 65 | +
|
| 66 | +The Container.read() method reads a list of children. It calls |
| 67 | +Container.read_element() for each child, which checks if the Element is already |
| 68 | +loaded; if so, it returns that Element, and otherwise it reads the header and |
| 69 | +creates the appropriate Element instance. It then calls Element.read_data(), |
| 70 | +which for Master Elements will recursively call Container.read(), and for Atomic |
| 71 | +Elements will read, decode, and store its data. A Void Element will skip over |
| 72 | +its data. |
| 73 | +
|
| 74 | +The Container.read() method supports a summary option, which causes it to call |
| 75 | +Element.read_summary() instead of read_data(). The purpose of summary mode is |
| 76 | +for large master Elements to read their metadata without reading the entire |
| 77 | +Element, which may not even fit in memory. Currently the Elements implementing |
| 78 | +read_summary() are ElementMasterDefer and ElementSegment. The former simply |
| 79 | +skips reading its children in summary mode, and the latter intelligently finds |
| 80 | +its metadata using SeekHead entries without reading its Cluster entries, which |
| 81 | +generally comprise over 99% of the file. For the other Elements, read_summary() |
| 82 | +simply calls read_data(). |
| 83 | +
|
| 84 | +An Element stores its state of loadedness in the read_state attribute. |
| 85 | +Container.read_element() will in fact read a partially loaded Element when not |
| 86 | +in summary mode. |
| 87 | +
|
| 88 | +File implements the read_summary() method, which calls read_summary() on each |
| 89 | +top-level child. By default, the constructor of File runs read_summary(). |
| 90 | +
|
| 91 | +* Writing |
| 92 | +
|
| 93 | +This module supports in-place modification of Matroska EBML files. In theory it |
| 94 | +supports creating such files from scratch, except that it has no facility for |
| 95 | +creating EBML Header elements (beyond doing so "by hand") or for writing |
| 96 | +elements incrementally (so any data to be written must be stored in memory). |
| 97 | +The system for in-place modifications is described here. |
| 98 | +
|
| 99 | +When modifying potentially very large EBML files, it is important only to write |
| 100 | +the elements that have actually changed. The following are the ways in which an |
| 101 | +Element may differ from its state in a stream: |
| 102 | +
|
| 103 | + 1. Its position in the stream can be changed. |
| 104 | + 2. It can be resized. |
| 105 | + 3. Its value can be changed. |
| 106 | + 4. Child elements can be added, deleted, or moved. |
| 107 | + 5. It can be created programatically. |
| 108 | +
|
| 109 | +The Element.dirty property is True if any of the above conditions holds. It is |
| 110 | +calculated as follows. An Element stores the position in the stream at which it |
| 111 | +was read along with its original size, so that it knows if either has changed. |
| 112 | +An ElementAtomic also stores its original value (or a way of recognizing its |
| 113 | +original value). An ElementMaster recursively checks if any of its children is |
| 114 | +dirty. An element not read from a stream has no stored position or size, so it |
| 115 | +is always dirty. |
| 116 | +
|
| 117 | +The Container.write() method writes its children to a stream. It only writes |
| 118 | +children for which the dirty property is True. For each such child it calls the |
| 119 | +Element's write() method. Master elements will recursively call the container's |
| 120 | +write() method, and Atomic elements will encode and write their data. A Void |
| 121 | +Element just seeks the stream. An Atomic Element which is not dirty should |
| 122 | +reproduce the byte stream used to read it when write() is called. |
| 123 | +
|
| 124 | +Performing modifications may place a Container (e.g. a Master element) in an |
| 125 | +inconsistent state. For example, a child element might grow, so that its data |
| 126 | +overlaps the beginning of the next element, or a child might be deleted, which |
| 127 | +leaves empty space. A Container's state is said to be consistent if the |
| 128 | +following hold: |
| 129 | +
|
| 130 | + 1. The first element starts at relative position zero. |
| 131 | + 2. Element i+1 starts immediately after element i ends. |
| 132 | + 3. The container's children are allowed children by the Matroska specification. |
| 133 | + 4. Required children (as defined by the Matroska specification) are present. |
| 134 | + 5. Children that are required to be unique by the Matroska specification are in |
| 135 | + fact unique. |
| 136 | + 6. Every child container is consistent. |
| 137 | + 7. The values of non-container children are consistent with the Matroksa |
| 138 | + specification (e.g. are contained in a specified range of values). |
| 139 | +
|
| 140 | +If a Container is an Element, it must satisfy the following properties in |
| 141 | +addition: |
| 142 | +
|
| 143 | + 8. The end of the last child coincides with the end of the Element's data. |
| 144 | + 9. Its parent is not None. |
| 145 | +
|
| 146 | +A Container will generally refuse to write its contents to disk if it is in an |
| 147 | +inconsistent state. To facilitate putting the Container in a consistent state, |
| 148 | +it provides the rearrange() method, which should be called before write(). This |
| 149 | +method rearranges the Container's children, potentially shrinking and moving |
| 150 | +them, so that there are no overlaps, recursively calling rearrange() on each |
| 151 | +Master child. It deletes and creates Void elements as necessary, and supports |
| 152 | +several options for controlling its behavior. The Container may be in an |
| 153 | +inconsistent state after calling rearrange() if its contents violate the |
| 154 | +Matroska specification in some way (e.g. if it has an impermissible child). |
| 155 | +
|
| 156 | +The Segment Element is more intelligent in its rearrange() method. It generates |
| 157 | +a SeekHead element at the beginning of the file with links to its children. It |
| 158 | +tries to move the more important children before the Clusters, and moves the |
| 159 | +rest to the end of the file. Its requirements for consistency are also a bit |
| 160 | +more specific than the ones outlined above. |
| 161 | +
|
| 162 | +* Viewing |
| 163 | +
|
| 164 | +Each Element implements __repr__() and __str__(). The former returns the class |
| 165 | +name and some size information, whereas the latter also includes some |
| 166 | +information about the contents of the Element. The return value of each should |
| 167 | +fit on one line. |
| 168 | +
|
| 169 | +Element instances also implement the summary() method, which returns a summary |
| 170 | +of the Element contents. By default, summary() returns the output of __str__(). |
| 171 | +The output may span multiple lines, although it is not terminated by a newline. |
| 172 | +
|
| 173 | +Container instances implement two additional methods, print_children() and |
| 174 | +print_space(). The former recursively runs __str__() on all child Elements (up |
| 175 | +to a specified recursion depth) and concatenates them with indentation in a |
| 176 | +newline-terminated string. The latter returns a newline-terminated table |
| 177 | +summarizing which child (and descendent) elements occupy which blocks of space. |
| 178 | +""" |
| 179 | + |
| 180 | +__all__ = ['EbmlException', 'Inconsistent', 'DecodeError', 'MAX_DATA_SIZE'] |
| 181 | + |
| 182 | +################################################################################ |
| 183 | +# * Exception |
| 184 | + |
| 185 | +from .. import MoviesException |
| 186 | + |
| 187 | +class EbmlException(MoviesException): |
| 188 | + """Class for general EBML exceptions.""" |
| 189 | + |
| 190 | +class Inconsistent(EbmlException): |
| 191 | + """Raised when a Container is not in a consistent state.""" |
| 192 | + |
| 193 | +class DecodeError(EbmlException): |
| 194 | + """Class for EBML decoding errors.""" |
| 195 | + |
| 196 | + |
| 197 | +################################################################################ |
| 198 | +# * Constants |
| 199 | + |
| 200 | +# Maximum data size that EBML can encode |
| 201 | +MAX_DATA_SIZE = (1<<56) - 2 |
0 commit comments