gh-151763: Fix OOM-0034 tokenizer offset error handling by zainnadeem786 · Pull Request #151798 · python/cpython

zainnadeem786 · 2026-06-20T13:29:54Z

Summary

This PR addresses OOM-0034 from gh-151763.

It fixes an out-of-memory (OOM) failure path in the tokenizer offset conversion logic that could lead to:

a NULL pointer dereference (access violation), or
a tokenizer result being returned while a MemoryError was still pending.

Issue

_PyPegen_byte_offset_to_character_offset_line() in Parser/pegen.c calls PyUnicode_AsUTF8() and immediately dereferences the returned pointer.

Under memory-allocation failure, PyUnicode_AsUTF8() can return NULL with a pending MemoryError. The existing code did not check for this condition before accessing the returned buffer.

Example crash path:

tokenizeriter_next()
└── _get_col_offsets()
    └── _PyPegen_byte_offset_to_character_offset_line()
        └── PyUnicode_AsUTF8()
            └── returns NULL under OOM
                └── NULL dereference

The issue was reproducible with _testcapi.set_nomemory() using non-ASCII source lines that require byte-offset to character-offset conversion.

Reproduction

Using a CPython debug build and OOM injection:

import _testcapi
import _tokenize

source = "if True:\n  \u00e9 = 1\n"

it = _tokenize.TokenizerIter(
    iter(source.splitlines(True)).__next__,
    extra_tokens=False,
)

for _ in range(5):
    next(it)

_testcapi.set_nomemory(2, 3)
next(it)

Observed result before the fix:

Windows fatal exception: access violation
RC 3221225477

Further investigation showed that a helper-only NULL check was not sufficient.

_get_col_offsets() ignored failure returns from offset-conversion helpers, allowing tokenizeriter_next() to continue execution with a pending MemoryError.

This could also result in:

SystemError:
<built-in function next> returned a result with an exception set

Root Cause

Two independent problems existed:

_PyPegen_byte_offset_to_character_offset_line() did not check whether PyUnicode_AsUTF8() failed.
_get_col_offsets() ignored failures from:
- _PyPegen_byte_offset_to_character_offset_line()
- _PyPegen_byte_offset_to_character_offset_raw()
As a result, tokenizer execution could continue after offset conversion had already failed.

Fix

This PR:

Adds a NULL check after PyUnicode_AsUTF8().
Returns -1 when UTF-8 conversion fails.
Changes _get_col_offsets() from void to int.
Propagates failures from both offset-conversion helpers.
Stops token generation immediately when offset conversion fails.
Preserves the pending MemoryError.
Avoids modifying tokenizer state when offset computation fails.

Regression Tests

Adds regression coverage in Lib/test/test_tokenize.py.

The test:

Uses _testcapi.set_nomemory() in a subprocess.
Exercises the non-ASCII tokenizer path.
Exercises the raw offset-conversion path.
Sweeps allocation-failure indexes rather than relying on a single build-specific allocation point.
Verifies that OOM conditions produce MemoryError instead of crashes or invalid tokenizer results.

Validation

Built and tested using a CPython debug build:

PCbuild\build.bat -p x64 -c Debug

Executed:

PCbuild/amd64/python_d.exe -m test test_tokenize

Result:

131 tests passed

Also verified:

git diff --check

No whitespace errors were reported.

Impact

This change converts tokenizer OOM crash paths into correct exception propagation and ensures that allocation failures during column-offset conversion are handled safely and consistently.

Addresses OOM-0034 from gh-151763.

zainnadeem786 · 2026-06-20T14:07:42Z

The failing TSan free-threading job appears to fail in test_abc, which is unrelated to this tokenizer/pegen change. All other relevant checks, including test_tokenize, lint, ASAN, UBSAN, normal TSan, Windows, macOS, and Ubuntu jobs are passing. Happy to rerun the failed job if needed.

pythongh-151763: Fix OOM-0034 tokenizer offset error handling

82b78b0

zainnadeem786 requested review from lysnikolaou and pablogsal as code owners June 20, 2026 13:29

bedevere-app Bot mentioned this pull request Jun 20, 2026

CPython crashes on memory-allocation failure (OOM): 35 findings #151763

Open

bedevere-app Bot added the awaiting review label Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-151763: Fix OOM-0034 tokenizer offset error handling#151798

gh-151763: Fix OOM-0034 tokenizer offset error handling#151798
zainnadeem786 wants to merge 1 commit into
python:mainfrom
zainnadeem786:fix/oom-0034-tokenizer

zainnadeem786 commented Jun 20, 2026

Uh oh!

zainnadeem786 commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zainnadeem786 commented Jun 20, 2026

Summary

Issue

Reproduction

Root Cause

Fix

Regression Tests

Validation

Impact

Uh oh!

zainnadeem786 commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant