intel/intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

Introduction

This is the collection of my open source contributions to intel-xpu-backend-for-triton.

Intel® XPU Backend for Triton, a new Triton backend for Intel GPUs. Intel® XPU Backend for Triton* is a out of tree backend module for Triton used to provide best-in-class performance and productivity on any Intel GPUs for PyTorch and standalone usage.*

Contributions

Summary

PRs

  • #4484 Use well tuned kernel options for flex attention
  • #4271 Support global scratch in launcher
  • #4448 Add softmax onednn impl in benchmarks/triton_kernels_benchmark/fused_softmax.py
  • #4146 Handle op with multi results case in changeAndPropagateLayout
  • #3937 Add dot3d[8-2-64-64-64-32-32-float32-float32] to skiplists
  • #3875 Make sure install setuptools>=78.1.0 in setup-triton
  • #3803 add f32 rtne to tf32 in DPAS
  • #3795 Reland “Check the non 4-bytes aligned base/offsetX/width on block pointer (#3712)”
  • #3712 Check the non 4-bytes aligned base/offsetX/width on block pointer #3712”
  • #3705 [GEMM] Add the tensor of pointer benchmark
  • #3644 Fix dpas_to_block_layout_convert.mlir
  • #3497 Add rewrite_stack_ptr post process pass
  • #3571 Fix test_reduce_layouts for LinearLayout
  • #3135 Fix AOT compilation failed in Test with pip workflow
  • #3108 Add Flash Attention backward to benchmarks/triton_kernels_benchmark
  • #2953 Port and run tests in python/test/unit/tools
  • #3010 Fix test_gather
  • #2839 Improve performance of shape 1024x1024x1024 out of box
  • #2646 Improve GEMM performance of shape 4096x8x128x16384
  • #2601 Improve GEMM performance of shape 4096x8x128x16384
  • #2520 [XeTLA] Add xetla splitk gemm
  • #2438 [Benchmark] Run xetla streamk gemm in benchmark
  • #2367 Add XeTLA FA backward implementation to benchmark
  • #2357 Add causal variant in fa benchmark
  • #2309 [Benchmarks] Add more variants in XeTLA FA implementation
  • #2060 Add Triton benchmark support in compare script
  • #2157 Add attention adv path benchmark
  • #1877 Update XeTLA’s attn implementation of Triton benchmark
  • #1799 Eliminate XeTLA GEMM performance gap
  • #1714 Add streamk xetla kernel
  • #1741 Remove cache triton in triton benchmark
  • #1730 [Benchmark] Fix xetla batch gemm cases
  • #1707 Add debug code for capture failure details
  • #1695 [BUG] Warkaround attr allocation.offset assertion failure
  • #1597 Update GEMM XeTLA kernel of triton benchmarks
  • #1539 Integrate flash attention XeTLA kernel into triton repo
  • #1383 Add triton bench deps step
  • #1092 [Performance] Clean and refine softmax and gemm benchmarks
  • #877 [Performance] xetla kernels benchmark integration
  • #977 enable block pointer gemm tutorial with new passes
  • #845 [UT] use cpu result as reference for fdiv
  • #614 Enable test_attention_fwd_bwd
  • #308 minimize token permissions in workflows
  • #246 [UT] Port and run operator tests
  • #143 [ut] some operators and language cases
  • #133 [CI] Update ut scope and pass rate calculation
  • #136 [CI]Add dockerfiles
  • #129 [CI] Refine CI workflows
  • #127 [CI] Migrate action runners to dedicated one
  • #124 [CI] add nightly failure notify support
  • #120 update ZE_AFFINITY_MASK
  • #112 upd triton hash
  • #109 update usage on env_triton.sh
  • #97 env prepare explicitly in workflows
  • #81 add e2e perf test to nightly
  • #60 fix wrong script path
  • #59 reduce e2e test iterations
  • #51 Add Inductor e2e workflow for triton xpu backend