Skip expensive debug_lines computation in AOT autograd cache#179733
Skip expensive debug_lines computation in AOT autograd cache#179733frgossen wants to merge 1 commit intogh/frgossen/15/basefrom
Conversation
FxGraphCachePickler.debug_lines re-hashes every attribute of the cache details object individually. This runs unconditionally even when debug logging is disabled. Gate the computation behind log.isEnabledFor(logging.DEBUG) so the cost is only paid when someone is actively debugging cache key differences. On a vLLM Meta-Llama-3-70B-Instruct TP=4 benchmark, this reduces cold compile time from 30.50 ± 0.50 s to 29.40 ± 0.90 s (1.04x) and cache lookup time from 11.35 ± 0.45 ms to 6.25 ± 0.30 ms (1.82x). Authored with Claude. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/179733
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit d1011f0 with merge base acdb423 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
Starting merge as part of PR stack under #179910 |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
_get_dict is called from save_config_portable on every AOT autograd cache key computation. 1. It called copy.deepcopy on every config value, but the vast majority are immutable types (bool, int, str, None) that don't need copying. Now only list/set/dict values are deep-copied. 2. It went through __getattr__ for every value, which includes deprecation warning checks, alias resolution, and other overhead. Now reads values directly from config entries. On a vLLM Meta-Llama-3-70B-Instruct TP=4 benchmark, this reduces cold compile time from 29.40 ± 0.90 s to 28.50 ± 0.40 s (1.03x) and cache lookup time from 6.25 ± 0.30 ms to 5.20 ± 0.45 ms (1.20x). Authored with Claude. Pull Request resolved: #179734 Approved by: https://github.com/aorenste ghstack dependencies: #179733
Stack from ghstack (oldest at bottom):
FxGraphCachePickler.debug_lines re-hashes every attribute of the cache
details object individually. This runs unconditionally even when debug
logging is disabled.
Gate the computation behind log.isEnabledFor(logging.DEBUG) so the
cost is only paid when someone is actively debugging cache key
differences.
On a vLLM Meta-Llama-3-70B-Instruct TP=4 benchmark, this reduces
cold compile time from 30.50 ± 0.50 s to 29.40 ± 0.90 s (1.04x)
and cache lookup time from 11.35 ± 0.45 ms to 6.25 ± 0.30 ms
(1.82x).
Authored with Claude.
cc @oulgen @jamesjwu @aorenste @anijain2305 @laithsakka @penguinwu @masnesral @coconutruben @aditvenk