hclimb/hbr

tasks/flash-attention-eval

flash attention eval

evaluates whether frontier LLMs can implement high-performance GPU kernels from a mathematical description. no hand-holding, no fill-in-the-blank templates, no pytest. just: here's the math, here's a GPU, make it work and make it fast.

general·flash-attentiontritongpu-kernelsattentionoptimization·1x H100·updated 1h ago

runs

passed

failed

pass %

87.5%

best

1.0000

timing

5.69ms

Runs8

view all →

Agent	Status	Reward	Time	Duration	When
claude-codeclaude-sonnet-4-6	Failed

instruction.mdmarkdown

396 lines

1# Memory-Efficient Attention

3## The Problem

5Standard scaled dot-product attention computes:

7 O = softmax(Q @ K^T / sqrt(d)) @ V

9This requires materializing an (N × N) attention matrix, where N is the sequence

10length. At N=16384 with 16 heads in bfloat16, that matrix alone is **8 GB** —

11making it impractical for long sequences.

13Your job: **implement attention without materializing the full N×N matrix, then optimize it to the absolute limit of the hardware.**

15You have an **NVIDIA H100 GPU**. You will write code, test it on

16real hardware, and optimize relentlessly until you run out of time.

18## Interface

20Edit **`/app/repo/solution.py`**. Implement a single function:

22```

23def efficient_attention(Q, K, V, is_causal=False) -> O

24```

26- **Q**: `(B, Nq, D)` — queries, float32 or bfloat16, CUDA

27- **K**: `(B, Nk, D)` — keys

28- **V**: `(B, Nk, D)` — values

29- **is_causal**: bool — when True, mask positions where query_index < key_index

30- **O**: `(B, Nq, D)` — output, same dtype as Q

32Must support `torch.autograd` (backward pass must produce correct dQ, dK, dV).

34You are free to use **any approach**: Triton kernels, `torch.compile`, custom

35autograd functions, whatever works. You may create additional `.py` files in

36`/app/repo/` and import from them.

38## Constraints

401. Must be **memory-efficient**: run at B=16, N=16384, D=64, bf16 on a single H100 without OOM.

412. Must be **correct**: match naive attention output within rtol=1e-2, atol=1e-2.

423. Must support **backward pass**: gradients dQ, dK, dV must be correct.

434. Must support **causal masking**.

445. **You must write your own implementation from scratch.** You may use PyTorch and Triton as building blocks, but the attention algorithm itself must be yours.

46## Anti-Cheat Policy (READ THIS)

48The verifier performs comprehensive cheat detection. If ANY of the following are

49detected, your score is **instantly zero** with no partial credit:

show all 396 lines (+346 more)

51**Banned libraries and functions:**

52- `torch.nn.functional.scaled_dot_product_attention` (including via `F.scaled_dot_product_attention`)

53- `flash_attn` library (including any submodule)

54- `xformers` (including `xformers.ops.memory_efficient_attention`)

55- Any pre-existing efficient attention implementation

57**The verifier will catch you even if you try to hide it:**

58- It scans ALL `.py` and `.sh` files in your solution directory

flash attention eval

Runs8

Files

Repo4