Skip to content

Commit 3d1f1db

Browse files
authored
Initial commit
1 parent dcb0c02 commit 3d1f1db

16 files changed

Lines changed: 7660 additions & 0 deletions

__init__.py

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
"""
2+
Module to parse a Matroska EBML file.
3+
4+
* Overview
5+
6+
An EBML file is a sequence of EBML Elements one after another. An Element
7+
consists of a two-part header encoding the Element ID and its data size,
8+
followed by that many bytes of data. The Matroska specification defines some
9+
number of EBML IDs, which can be found in a Matroska project file called
10+
specdata.xml. Each defined ID has a human-readable name, e.g. 'Segment'. The
11+
semantics of the data depends on the Element type. EBML defines the following
12+
primitive Element types:
13+
14+
+ Master: the data is a sequence of child Elements.
15+
+ Unsigned, Signed: the data is an integer in big-endian form.
16+
+ String, Unicode: the data is a string encoded in ascii or utf-8.
17+
+ Float: the data is a 4-byte or 8-byte floating point number.
18+
+ Date: the data is an 8-byte signed integer representing the number of
19+
nanoseconds since the Matroska epoch.
20+
+ Binary: the data is opaque.
21+
22+
The Matroska EBML Element specifications are stored in a special-purpose
23+
dictionary called MATROSKA_TAGS which is used to create Elements of the
24+
appropriate class when reading them from a stream.
25+
26+
This module defines the following classes for reading, storing, editing, and
27+
writing the data in a Matroska file.
28+
29+
+ Header: stores and manipulates the Element header.
30+
+ Container: stores child Elements. This is subclassed by ElementMaster and
31+
File. As File is not an Element -- it has no header -- neither is Container.
32+
+ File: facilitates reading and writing Elements from a stream.
33+
+ Element: base class for all EBML Elements.
34+
35+
Immediate subclasses of Element:
36+
37+
+ ElementMaster: Inherits both Element and Container.
38+
+ ElementAtomic: Base class for all kinds of Elements that actually know how to
39+
interpret their data. Subclassed by ElementUnsigned, ElementUnicode, etc.
40+
+ ElementVoid: Element that ignores its data on read and writes undefined
41+
values.
42+
+ ElementUnsupported: An element this module does not support. It cannot be
43+
resized or written.
44+
45+
This module provides the Parsed descriptor, which is a convenience class that
46+
allows Master Elements to read and write the data in child Elements using
47+
attributes. For instance, the ElementInfo class has the segment_uid attribute;
48+
if info is an instance of ElementInfo then info.segment_uid reads and writes the
49+
value of its child SegmentUID. If no such child exists, reading
50+
info.segment_uid returns a default value, and setting it creates the child.
51+
This is much easier to use than, say,
52+
53+
uid_elements = list(info.children_named('SegmentUID'))
54+
if uid_elements:
55+
return uid_elements[-1].value
56+
else
57+
return default_value
58+
59+
The ElementSegment class takes advantage of Parsed descriptors to give easy
60+
access to the segment metadata. Classes using this facility: ElementEBML,
61+
ElementSegment, ElementSeek, ElementInfo, ElementTrackEntry, ElementVideo,
62+
ElementAudio, ElementAttachedFile, etc.
63+
64+
* Reading
65+
66+
The Container.read() method reads a list of children. It calls
67+
Container.read_element() for each child, which checks if the Element is already
68+
loaded; if so, it returns that Element, and otherwise it reads the header and
69+
creates the appropriate Element instance. It then calls Element.read_data(),
70+
which for Master Elements will recursively call Container.read(), and for Atomic
71+
Elements will read, decode, and store its data. A Void Element will skip over
72+
its data.
73+
74+
The Container.read() method supports a summary option, which causes it to call
75+
Element.read_summary() instead of read_data(). The purpose of summary mode is
76+
for large master Elements to read their metadata without reading the entire
77+
Element, which may not even fit in memory. Currently the Elements implementing
78+
read_summary() are ElementMasterDefer and ElementSegment. The former simply
79+
skips reading its children in summary mode, and the latter intelligently finds
80+
its metadata using SeekHead entries without reading its Cluster entries, which
81+
generally comprise over 99% of the file. For the other Elements, read_summary()
82+
simply calls read_data().
83+
84+
An Element stores its state of loadedness in the read_state attribute.
85+
Container.read_element() will in fact read a partially loaded Element when not
86+
in summary mode.
87+
88+
File implements the read_summary() method, which calls read_summary() on each
89+
top-level child. By default, the constructor of File runs read_summary().
90+
91+
* Writing
92+
93+
This module supports in-place modification of Matroska EBML files. In theory it
94+
supports creating such files from scratch, except that it has no facility for
95+
creating EBML Header elements (beyond doing so "by hand") or for writing
96+
elements incrementally (so any data to be written must be stored in memory).
97+
The system for in-place modifications is described here.
98+
99+
When modifying potentially very large EBML files, it is important only to write
100+
the elements that have actually changed. The following are the ways in which an
101+
Element may differ from its state in a stream:
102+
103+
1. Its position in the stream can be changed.
104+
2. It can be resized.
105+
3. Its value can be changed.
106+
4. Child elements can be added, deleted, or moved.
107+
5. It can be created programatically.
108+
109+
The Element.dirty property is True if any of the above conditions holds. It is
110+
calculated as follows. An Element stores the position in the stream at which it
111+
was read along with its original size, so that it knows if either has changed.
112+
An ElementAtomic also stores its original value (or a way of recognizing its
113+
original value). An ElementMaster recursively checks if any of its children is
114+
dirty. An element not read from a stream has no stored position or size, so it
115+
is always dirty.
116+
117+
The Container.write() method writes its children to a stream. It only writes
118+
children for which the dirty property is True. For each such child it calls the
119+
Element's write() method. Master elements will recursively call the container's
120+
write() method, and Atomic elements will encode and write their data. A Void
121+
Element just seeks the stream. An Atomic Element which is not dirty should
122+
reproduce the byte stream used to read it when write() is called.
123+
124+
Performing modifications may place a Container (e.g. a Master element) in an
125+
inconsistent state. For example, a child element might grow, so that its data
126+
overlaps the beginning of the next element, or a child might be deleted, which
127+
leaves empty space. A Container's state is said to be consistent if the
128+
following hold:
129+
130+
1. The first element starts at relative position zero.
131+
2. Element i+1 starts immediately after element i ends.
132+
3. The container's children are allowed children by the Matroska specification.
133+
4. Required children (as defined by the Matroska specification) are present.
134+
5. Children that are required to be unique by the Matroska specification are in
135+
fact unique.
136+
6. Every child container is consistent.
137+
7. The values of non-container children are consistent with the Matroksa
138+
specification (e.g. are contained in a specified range of values).
139+
140+
If a Container is an Element, it must satisfy the following properties in
141+
addition:
142+
143+
8. The end of the last child coincides with the end of the Element's data.
144+
9. Its parent is not None.
145+
146+
A Container will generally refuse to write its contents to disk if it is in an
147+
inconsistent state. To facilitate putting the Container in a consistent state,
148+
it provides the rearrange() method, which should be called before write(). This
149+
method rearranges the Container's children, potentially shrinking and moving
150+
them, so that there are no overlaps, recursively calling rearrange() on each
151+
Master child. It deletes and creates Void elements as necessary, and supports
152+
several options for controlling its behavior. The Container may be in an
153+
inconsistent state after calling rearrange() if its contents violate the
154+
Matroska specification in some way (e.g. if it has an impermissible child).
155+
156+
The Segment Element is more intelligent in its rearrange() method. It generates
157+
a SeekHead element at the beginning of the file with links to its children. It
158+
tries to move the more important children before the Clusters, and moves the
159+
rest to the end of the file. Its requirements for consistency are also a bit
160+
more specific than the ones outlined above.
161+
162+
* Viewing
163+
164+
Each Element implements __repr__() and __str__(). The former returns the class
165+
name and some size information, whereas the latter also includes some
166+
information about the contents of the Element. The return value of each should
167+
fit on one line.
168+
169+
Element instances also implement the summary() method, which returns a summary
170+
of the Element contents. By default, summary() returns the output of __str__().
171+
The output may span multiple lines, although it is not terminated by a newline.
172+
173+
Container instances implement two additional methods, print_children() and
174+
print_space(). The former recursively runs __str__() on all child Elements (up
175+
to a specified recursion depth) and concatenates them with indentation in a
176+
newline-terminated string. The latter returns a newline-terminated table
177+
summarizing which child (and descendent) elements occupy which blocks of space.
178+
"""
179+
180+
__all__ = ['EbmlException', 'Inconsistent', 'DecodeError', 'MAX_DATA_SIZE']
181+
182+
################################################################################
183+
# * Exception
184+
185+
from .. import MoviesException
186+
187+
class EbmlException(MoviesException):
188+
"""Class for general EBML exceptions."""
189+
190+
class Inconsistent(EbmlException):
191+
"""Raised when a Container is not in a consistent state."""
192+
193+
class DecodeError(EbmlException):
194+
"""Class for EBML decoding errors."""
195+
196+
197+
################################################################################
198+
# * Constants
199+
200+
# Maximum data size that EBML can encode
201+
MAX_DATA_SIZE = (1<<56) - 2

0 commit comments

Comments
 (0)