Senior AI Performance Engineer
Advanced Micro Devices Näytä kaikki työpaikat
- Helsinki
- Vakituinen
- Täyspäiväinen
- Drive performance optimization across the stack on leading models and customer-relevant serving configurations, closing competitive gaps through kernel and systems-level optimizations
- Profile, diagnose, and resolve cross-stack performance bottlenecks, from GPU kernels and operator dispatch to framework-level scheduling and multi-node communication
- Diagnose kernel-level performance issues using profiling tools: identify occupancy limitations, L2 cache thrashing, register pressure, memory coalescing issues, etc, and translate findings into actionable optimizations
- Participate in customer-facing technical engagements: present findings, recommend optimizations, and deliver measurable performance uplifts
- Integrate and optimize custom kernels (Triton, Gluon, CK, PyDSL, ASM, AITER) within serving frameworks, understanding dispatch paths, shape extraction, and backend selection
- Optimize multi-node distributed inference: communication-compute overlap, parallelism strategies, and scale-out performance
- Contribute to shared performance optimization methodology across the broader team
- Leverage AI agents to accelerate daily work
- Upstream optimizations into open-source frameworks such as vLLM, SGLang, and PyTorch
- 5+ years of software development experience in GPU computing, AI systems, or high-performance computing
- Hands-on experience with AI serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) and their internals
- Strong background in end-to-end workload profiling and bottleneck diagnosis
- Understanding of GPU kernel performance characteristics: occupancy, register and LDS pressure, memory coalescing, cache utilization, and instruction-level bottlenecks
- Ability to read and reason about kernel-level profiling data and translate it into concrete optimization actions
- Understanding of model architectures (transformers, MoE, diffusion), inference paradigms (speculative decoding, prefill-decode disaggregation, continuous batching), and how they map to hardware
- Experience with custom kernel development or integration (HIP, CUDA, Triton, CK, or similar)
- Understanding of multi-GPU and multi-node distributed systems
- Strong proficiency in Python and C++
- Customer-facing technical experience
- Daily user of AI agents and development tools
- Strong Linux systems knowledge
- Excellent written and verbal English communication skills