intel/intel-xpu-backend-for-triton
OpenAI Triton backend for Intel® GPUs
Introduction
This is the collection of my open source contributions to intel-xpu-backend-for-triton.
Intel® XPU Backend for Triton, a new Triton backend for Intel GPUs. Intel® XPU Backend for Triton* is a out of tree backend module for Triton used to provide best-in-class performance and productivity on any Intel GPUs for PyTorch and standalone usage.*
Contributions
Summary
PRs
- #4484 Use well tuned kernel options for flex attention
- #4271 Support global scratch in launcher
- #4448 Add softmax onednn impl in
benchmarks/triton_kernels_benchmark/fused_softmax.py
- #4146 Handle op with multi results case in changeAndPropagateLayout
- #3937 Add
dot3d[8-2-64-64-64-32-32-float32-float32]
toskiplists
- #3875 Make sure install
setuptools>=78.1.0
insetup-triton
- #3803 add f32 rtne to tf32 in DPAS
- #3795 Reland “Check the non 4-bytes aligned base/offsetX/width on block pointer (#3712)”
- #3712 Check the non 4-bytes aligned base/offsetX/width on block pointer #3712”
- #3705 [GEMM] Add the tensor of pointer benchmark
- #3644 Fix
dpas_to_block_layout_convert.mlir
- #3497 Add
rewrite_stack_ptr
post process pass - #3571 Fix
test_reduce_layouts
forLinearLayout
- #3135 Fix AOT compilation failed in Test with pip workflow
- #3108 Add Flash Attention backward to
benchmarks/triton_kernels_benchmark
- #2953 Port and run tests in python/test/unit/tools
- #3010 Fix
test_gather
- #2839 Improve performance of shape 1024x1024x1024 out of box
- #2646 Improve GEMM performance of shape 4096x8x128x16384
- #2601 Improve GEMM performance of shape 4096x8x128x16384
- #2520 [XeTLA] Add xetla splitk gemm
- #2438 [Benchmark] Run xetla streamk gemm in benchmark
- #2367 Add
XeTLA
FA backward implementation to benchmark - #2357 Add causal variant in fa benchmark
- #2309 [Benchmarks] Add more variants in XeTLA FA implementation
- #2060 Add Triton benchmark support in compare script
- #2157 Add attention adv path benchmark
- #1877 Update XeTLA’s attn implementation of Triton benchmark
- #1799 Eliminate
XeTLA
GEMM performance gap - #1714 Add streamk xetla kernel
- #1741 Remove cache triton in triton benchmark
- #1730 [Benchmark] Fix xetla batch gemm cases
- #1707 Add debug code for capture failure details
- #1695 [BUG] Warkaround attr
allocation.offset
assertion failure - #1597 Update GEMM XeTLA kernel of triton benchmarks
- #1539 Integrate flash attention XeTLA kernel into triton repo
- #1383 Add triton bench deps step
- #1092 [Performance] Clean and refine softmax and gemm benchmarks
- #877 [Performance] xetla kernels benchmark integration
- #977 enable block pointer gemm tutorial with new passes
- #845 [UT] use cpu result as reference for
fdiv
- #614 Enable test_attention_fwd_bwd
- #308 minimize token permissions in workflows
- #246 [UT] Port and run operator tests
- #143 [ut] some operators and language cases
- #133 [CI] Update ut scope and pass rate calculation
- #136 [CI]Add dockerfiles
- #129 [CI] Refine CI workflows
- #127 [CI] Migrate action runners to dedicated one
- #124 [CI] add nightly failure notify support
- #120 update ZE_AFFINITY_MASK
- #112 upd triton hash
- #109 update usage on env_triton.sh
- #97 env prepare explicitly in workflows
- #81 add e2e perf test to nightly
- #60 fix wrong script path
- #59 reduce e2e test iterations
- #51 Add Inductor e2e workflow for triton xpu backend