Senior Framework Engineer — Diffusion Inference
Advanced Micro Devices Näytä kaikki työpaikat
- Helsinki
- Vakituinen
- Täyspäiväinen
- Develop and maintain a diffusion inference framework for image/video generation with clean APIs and strong compatibility with widely used diffusion ecosystems (e.g., HuggingFace diffusers-style pipelines and model wrappers).
- Own scalable parallel inference features for DiT workloads—single-node and multi-node—including:
- Sequence parallel approaches (e.g., Unified Sequence Parallelism)
- Pipeline-style methods
- CFG-parallel execution strategies
- Integrate optimized operator backends (attention, GEMM, quantized paths) by bridging Python/C++ layers and ensuring correctness and high performance.
- Ship production-grade packaging & releases including:
- Containers (Docker)
- Versioned artifacts
- Dependency hygiene
- Pip-installable distributions
- Build continuous testing & benchmarking infrastructure, including correctness tests, performance regression gating, and "known-good configurations" across GPU SKUs and cluster topologies.
- Collaborate across the GPU software stack (runtime, collectives, libraries, compilers) and translate framework needs into actionable upstream improvements.
- Support strategic customers by mapping real-world inference constraints (latency, throughput, memory) into framework features, reference configurations, and reproducible deployment recipes.
- Communicate clearly around technical tradeoffs, performance bottlenecks, and roadmap decisions with teams and stakeholders across Europe, the US, and Asia.
- Implement “day-1” enablement for a new image/video DiT model, including wrapper creation, pipeline integration, correctness verification, and performance baseline setup.
- Develop or improve distributed inference recipes (e.g., USP + communication-aware tuning) and validate across multi-node clusters.
- Integrate a new high-performance attention backend (e.g., FlashAttention v3-class) and expose it cleanly via the framework API.
- Build a regression suite targeting attention-heavy and GEMM-heavy workloads and add CI gates to prevent performance regressions.
- Define and drive a framework roadmap covering features, milestones, and prioritized upstream kernel/runtime requests.
- Diffusion inference internals (DiTs, attention scaling, VAE integration points, scheduler loops, memory behavior).
- Distributed inference systems (multi-GPU & multi-node), including communication patterns and latency-sensitive execution.
- Framework plumbing: integrating optimized kernels/operators into user-facing APIs with correctness and reliability guarantees.
- Release engineering: containers, dependency management, CI pipelines, performance dashboards, reproducible benchmarks.
- Strong Python and/or C++ engineering skills (debugging, profiling, testing, navigating complex codebases, clean abstractions).
- Experience with ML frameworks—PyTorch strongly preferred, JAX/TF welcome—and familiarity with diffusion model execution.
- Proven ability to work in GPU-accelerated environments with intuition for performance, memory/compute tradeoffs, and profiling.
- Comfort with containers (Docker) and modern dev workflows (git, CI, build systems).
- Strong cross-functional collaboration and clear technical communication skills.
- Experience with diffusion inference engines or parallel inference frameworks for DiTs (sequence, pipeline, CFG-parallel concepts).
- Exposure to operator libraries such as AITER-style kernel collections (attention/GEMM/quant/comm).
- GPU kernel development experience (HIP/CUDA/Triton) or familiarity with compiler/codegen backends.
- Knowledge of high-performance networking (RDMA, RoCE, InfiniBand, UCX) for multi-node inference.
- Experience building benchmarking and performance regression systems at scale.
- BSc, MSc, PhD, or equivalent experience in Computer Science, Electrical Engineering, or a related field.