Skip to content

gh-151763: Fix OOM-0034 tokenizer offset error handling#151798

Open
zainnadeem786 wants to merge 1 commit into
python:mainfrom
zainnadeem786:fix/oom-0034-tokenizer
Open

gh-151763: Fix OOM-0034 tokenizer offset error handling#151798
zainnadeem786 wants to merge 1 commit into
python:mainfrom
zainnadeem786:fix/oom-0034-tokenizer

Conversation

@zainnadeem786

Copy link
Copy Markdown

Summary

This PR addresses OOM-0034 from gh-151763.

It fixes an out-of-memory (OOM) failure path in the tokenizer offset conversion logic that could lead to:

  • a NULL pointer dereference (access violation), or
  • a tokenizer result being returned while a MemoryError was still pending.

Issue

_PyPegen_byte_offset_to_character_offset_line() in Parser/pegen.c calls PyUnicode_AsUTF8() and immediately dereferences the returned pointer.

Under memory-allocation failure, PyUnicode_AsUTF8() can return NULL with a pending MemoryError. The existing code did not check for this condition before accessing the returned buffer.

Example crash path:

tokenizeriter_next()
└── _get_col_offsets()
    └── _PyPegen_byte_offset_to_character_offset_line()
        └── PyUnicode_AsUTF8()
            └── returns NULL under OOM
                └── NULL dereference

The issue was reproducible with _testcapi.set_nomemory() using non-ASCII source lines that require byte-offset to character-offset conversion.

Reproduction

Using a CPython debug build and OOM injection:

import _testcapi
import _tokenize

source = "if True:\n  \u00e9 = 1\n"

it = _tokenize.TokenizerIter(
    iter(source.splitlines(True)).__next__,
    extra_tokens=False,
)

for _ in range(5):
    next(it)

_testcapi.set_nomemory(2, 3)
next(it)

Observed result before the fix:

Windows fatal exception: access violation
RC 3221225477

Further investigation showed that a helper-only NULL check was not sufficient.

_get_col_offsets() ignored failure returns from offset-conversion helpers, allowing tokenizeriter_next() to continue execution with a pending MemoryError.

This could also result in:

SystemError:
<built-in function next> returned a result with an exception set

Root Cause

Two independent problems existed:

  1. _PyPegen_byte_offset_to_character_offset_line() did not check whether PyUnicode_AsUTF8() failed.

  2. _get_col_offsets() ignored failures from:

    • _PyPegen_byte_offset_to_character_offset_line()
    • _PyPegen_byte_offset_to_character_offset_raw()

    As a result, tokenizer execution could continue after offset conversion had already failed.

Fix

This PR:

  • Adds a NULL check after PyUnicode_AsUTF8().
  • Returns -1 when UTF-8 conversion fails.
  • Changes _get_col_offsets() from void to int.
  • Propagates failures from both offset-conversion helpers.
  • Stops token generation immediately when offset conversion fails.
  • Preserves the pending MemoryError.
  • Avoids modifying tokenizer state when offset computation fails.

Regression Tests

Adds regression coverage in Lib/test/test_tokenize.py.

The test:

  • Uses _testcapi.set_nomemory() in a subprocess.
  • Exercises the non-ASCII tokenizer path.
  • Exercises the raw offset-conversion path.
  • Sweeps allocation-failure indexes rather than relying on a single build-specific allocation point.
  • Verifies that OOM conditions produce MemoryError instead of crashes or invalid tokenizer results.

Validation

Built and tested using a CPython debug build:

PCbuild\build.bat -p x64 -c Debug

Executed:

PCbuild/amd64/python_d.exe -m test test_tokenize

Result:

131 tests passed

Also verified:

git diff --check

No whitespace errors were reported.

Impact

This change converts tokenizer OOM crash paths into correct exception propagation and ensures that allocation failures during column-offset conversion are handled safely and consistently.

Addresses OOM-0034 from gh-151763.

@zainnadeem786

Copy link
Copy Markdown
Author

The failing TSan free-threading job appears to fail in test_abc, which is unrelated to this tokenizer/pegen change. All other relevant checks, including test_tokenize, lint, ASAN, UBSAN, normal TSan, Windows, macOS, and Ubuntu jobs are passing. Happy to rerun the failed job if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant