Add single-dispatch layer-by-layer multi-head attention#91
Draft
Add single-dispatch layer-by-layer multi-head attention#91
Conversation
andrej
commented
Apr 6, 2026
Collaborator
Author
There was a problem hiding this comment.
Can we reuse the reference from the existing mha? (Note: does not include RoPE and Q, K, V projections, but some code reuse should be possible.)
Contributor
📊 Test Results for Test Example Applications1d87fe8 (2026_04_07_21_05_39) IRONCLADTested on
📈 Trends (vs main branch) for Test Example Applications1d87fe8 (2026_04_07_21_05_39) IRONCLAD Trendsllama_3.2_1b
llama_3.2_1b_prompt_1024_tokens_1
llama_3.2_1b_prompt_1024_tokens_40
llama_3.2_1b_prompt_13_tokens_1
llama_3.2_1b_prompt_13_tokens_40
llama_3.2_1b_prompt_2048_tokens_1
llama_3.2_1b_prompt_2048_tokens_40
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
"Naive" alternative implementation for multi-head attention from the currently checked-in data-flow design. This is a simple layer-by-layer implementation, but it uses the single-dispatch mechanism to fuse it all into one MLIR file and save on CPU roundtrips and XRT overheads.
Includes two variants:
Q,K,V. This matches the functionality of the checked-in dataflow MHA.