Columbia IBM Project: Implementing Paged Attention with Flex Attention by thomasjoshi · Pull Request #421 · foundation-model-stack/foundation-model-stack

thomasjoshi · 2025-06-02T17:30:18Z

Our project aims to integrate PyTorch's Paged Attention into the Foundation Model Stack (FMS) using Flex Attention. We intend to enhance memory efficiency and inference speed for long-context language models without sacrificing model accuracy. Specifically, we will implement a dynamic, paged key-value (KV) cache that minimizes memory fragmentation, benchmark its performance against standard attention mechanisms, and evaluate the impact of various paging strategies on overall model performance.

* Add `_flex_attn` attribute with lazy instantiation so we avoid recreating `FlexAttention` on every forward pass; sync dropout dynamically with training / eval mode. * Improve `_validate_paged_attention` with positive‑length check. * Extend class docstring to document `paged_attention_config`.

… unit tests

…ble to perform under such loads for capacity planning.

Created new sequence benchmark files

update attention test model setup

Remove backward pass benchmark

update tokenzier

OOM Fix

create csv and plot

remove 8192 case

add matplotlib to req.txt

update plot code

no cache

remove 4096 test

plot fix

none

update csv write

memory

seq

graphs

Hss2173/seq bench

thomasjoshi and others added 30 commits April 6, 2025 20:00

Paged Attention class

b0e9c1e

Create dir for final project

b69e042

Add Paged Attention and testing

e34a23b

Update README

3f0bcdc

Add README rubric

4ffb0fb

Update template for final report

2448744

Update Makefile targets

82c230c

Add Makefile

0fe410e

Switch to PyTorch 2.8.0.dev nightly to use Flex Attention API, update…

9ce6adb

… unit tests

Upgrade to using PyTorch's experimental paged attention implementation

20d21a4

Use flex attention kernel

37745d4

Setup PagedAttention for memory management

2ae1915

Fix unit test

a1748a1

Update test_paged_attention_memory_flash

97fb557

Clean-up

2d9b368

Switch to using --extra-index-url PyTorch nightly wheel

4aac14e

Fix test_llama_paged_attention

08074d9

Add inference benchmark Makefile targets

db5eb9a

Update presets for NVIDIA T4

a9ae423

Add Makefile target for bench-llama-paged-t4

0933184

Create final project README

b894a8a

Create code to measure memory footprint (GB)

0f056cb

Remove main and restructure execution tree

dfc32c4

Added attention memory benchmarking to akefile

ec35a9a

Added multi-request throughput benchmarking to ensure our system is a…

6f3da48

…ble to perform under such loads for capacity planning.

New Seq Benches

8a0fc14

Created new sequence benchmark files

Update benchmark_attention_runtime.py

7844747

update attention test model setup

Update benchmark_attention_runtime.py

2eebb5f

Remove backward pass benchmark

Update benchmark_attention_runtime.py

840f040

update tokenzier

nsd2147 and others added 30 commits May 8, 2025 04:38

Add bench-llama-t4-sweep target

6e756fa

Update benchmark_attention_runtime.py

ae3022d

OOM Fix

attention csv and plot

4c15966

create csv and plot

Update benchmark_attention_runtime.py

975792d

remove 8192 case

Update requirements.txt

ccb26f9

add matplotlib to req.txt

Update benchmark_attention_runtime.py

28bccc0

update plot code

Update bench-llama-t4-sweep

c9e62a8

Update benchmark_attention_runtime.py

29ea53e

no cache

Add bench-llama-mem-paged-t4

49bf730

Update benchmark_attention_runtime.py

2d29518

remove 4096 test

Update benchmark_attention_runtime.py

982fce4

plot fix

Update benchmark_attention_runtime.py

76ed5e6

none

Update benchmark_attention_runtime.py

483dd26

update csv write

profile_memory

65445ec

memory

Update benchmark_profile_memory.py

9b096eb

seq

mem prof

f1422ea

use cache mem prof

b7ade99

graphs

aed651a

graphs

Merge branch 'main' into hss2173/seq-bench

7bd9344

Merge pull request #1 from thomasjoshi/hss2173/seq-bench

8d3dcd7

Hss2173/seq bench

Add wandb support

b6bbed9

Generate CSVs and plots

4f68ac5

Change location of CSV and plots

7c0cabb

Update README

b71abd5

Update subdirs

5d38d36

Update README to new format

6d4b4ed

Remove latex targets

b18be52

Draft final report

3fc354f

Fixed report

2396754

Fixed report

c1722ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Columbia IBM Project: Implementing Paged Attention with Flex Attention#421

Columbia IBM Project: Implementing Paged Attention with Flex Attention#421
thomasjoshi wants to merge 65 commits into
foundation-model-stack:mainfrom
thomasjoshi:main

thomasjoshi commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thomasjoshi commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants