Batteries-included structural document IR parser for hwp, hwpx, and docx.
Provides a unified Pydantic-based data model for document structure, styles, and
content, along with APIs for parsing, editing, annotation, and HTML export.
Requires Python 3.13+
Additional docs:
pip install document-processorLocal development:
uv pip install -e /path/to/document-processor| Package | Purpose |
|---|---|
pydantic |
IR models and validation |
python-docx |
DOCX parsing and native write-back |
jpype1 |
HWP conversion via Java interop |
from document_processor import DocIR
doc = DocIR.from_file("/path/to/file.docx")
print(doc.paragraphs[0].text)
print(doc.paragraphs[0].runs[0].run_style.bold)
html = doc.to_html(title="Preview")The package covers:
- document parsing (DOCX, HWPX, HWP)
- style extraction (fonts, colors, alignment, spacing, borders, ...)
- structural IR creation
- embedded image extraction for
docxandhwpx - stateless text editing with native file write-back
- annotation resolution and review HTML rendering
All IR models include a .meta field for attaching processing metadata
(e.g. for LLMs, RAG, analysis).
for file_ in files:
doc = DocIR.from_file(file_)
class MyMetaData(BaseModel):
a: int = 1
b: str = "test"
metainfo = MyMetaData(a=2)
doc.paragraphs[0].runs[0].meta = metainfo
with \
open((out_dir / file_.stem).with_suffix(".json"), "w", encoding="utf-8") as json_f, \
open((out_dir / file_.stem).with_suffix(".html"), "w", encoding="utf-8") as html_f:
json.dump(doc.model_dump(mode="json"), json_f, indent=4, ensure_ascii=False)
html_f.write(doc.to_html())
print(f"completed: {file_}")Note: Metadata objects must extend Pydantic
BaseModel. Otherwise a validation error is raised.
Parsed image binaries are stored once on DocIR.assets, and paragraph-like nodes keep ordered
content entries so text, tables, and images can be rendered in source order.
from document_processor import DocIR
doc = DocIR.from_file("/path/to/file.docx")
first_asset = next(iter(doc.assets.values()))
html = doc.to_html()The stateless edit API lets you apply text edits to documents. Edits are
validated before application, and results can be returned as an updated DocIR,
written back to the native file format, or returned as bytes.
from document_processor import (
apply_text_edits,
ApplyTextEditsRequest,
DocumentInput,
TextEdit,
)
result = apply_text_edits(ApplyTextEditsRequest(
document=DocumentInput(source_path="/path/to/file.docx"),
edits=[TextEdit(
target_unit_id="s1.p3",
expected_text="old text",
new_text="new text",
)],
))Related helpers:
get_document_context()— fetch surrounding paragraphs for target IDslist_editable_targets()— enumerate safe edit targetsvalidate_text_edits()— dry-run validation without applying
Resolve text annotations against a document and render a highlighted review page:
from document_processor import (
render_review_html,
RenderReviewHtmlRequest,
DocumentInput,
TextAnnotation,
)
result = render_review_html(RenderReviewHtmlRequest(
document=DocumentInput(source_path="/path/to/file.docx"),
annotations=[TextAnnotation(
anchor_text="some phrase",
comment="Needs revision",
)],
))
html = result.htmlRender a parsed document to styled HTML:
from document_processor import DocIR
doc = DocIR.from_file("/path/to/file.docx")
html = doc.to_html(title="Preview")Install the visualization extra first:
pip install "document-processor[viz]"Erdantic also needs Graphviz available on the system.
Render the default DocIR model diagram:
document-processor-diagram --out docir.svgRender a package-scope diagram with IR fields/methods plus the main core/
modules:
document-processor-diagram --kind package --out package.svgRender a custom model by dotted import path:
document-processor-diagram --model document_processor.DocIR --out docir.pngOr use the Python helper:
from document_processor import draw_model_diagram
draw_model_diagram(out="docir.svg")ERD for the pydantic models