Principal AI Infrastructure Engineer - Hosting

Helsinki
Vakituinen
Täyspäiväinen

23 päivää sitten

Looking to work with data at the heart of decision-making and insight?Join a data analytics and engineering firm that helps organisations turn complex information into actionable intelligence. The team delivers tailored solutions across data strategy, architecture, engineering and visualisation, empowering clients to unlock value from their data and make smarter, evidence-driven decisions. With a focus on practical delivery and measurable outcomes, the organisation combines deep technical expertise with a collaborative, client-centric approach.They are looking for a high-impact Principal AI Infra Engineer who considers themselves a technical visionary who is genuinely passionate about the "nitty-gritty" of InfiniBand-connected GPU fabricsApply now if you are someone who thrives on scaling the world’s most demanding machine learning workloads and wants to solve real-world challenges through insight and innovation!Responsibilities:

Cluster Architecture: Architecting and scaling high-density GPU training environments utilising advanced InfiniBand interconnects (1000+ node scale).
System Optimisation: Owning end-to-end performance engineering, transforming high-level uptime and speed requirements into technical IOPS and throughput reality.
Development & Automation: Crafting high-level Python automation to handle everything from bare-metal provisioning and OS deployment to complex firmware orchestration.
Strategic Leadership: Defining the technical roadmap for the engineering organisation, spearheading peer reviews, and cultivating top-tier internal talent.
Monitoring & Reliability: Developing robust observability frameworks to track fabric health and compute baselines while deploying proactive anomaly detection.
Cross-Team Integration: Acting as the technical bridge between ML research teams, datacenter operations, and hardware procurement to turn theoretical needs into physical capacity.

Skills/Must have:

Track Record of deep systems engineering or infrastructure architecture experience in massive-scale environments.
Specialised Knowledge: Extensive experience managing large-scale HPC or AI clusters with specific expertise in InfiniBand fabrics and NVIDIA Hopper-class hardware.
Automation: Expert-level Python skills paired with a deep background in Infrastructure-as-Code (Terraform, Ansible, or SaltStack).
Technical Authority: Proven ability to lead complex engineering workstreams, establish technical standards, and mentor senior-level peers.

Benefits:

Stock option equity
Comprehensive healthcare, lunch, and wellbeing benefits

Salary:

€170,000 base per year

Hamilton Barnes

Hae nyt