AI Systems Engineer – AI Model (Training & Inference)
Advanced Micro Devices View all jobs
- Markham, ON
- Permanent
- Full-time
- Enable and optimize large-scale model training (LLMs, VLMs, MoE architectures) on AMD Instinct GPU clusters, ensuring correctness, reproducibility, and competitive throughput.
- Build and maintain training infrastructure: job orchestration, distributed checkpointing, data loading pipelines, and storage optimization for multi-thousand GPU clusters on Kubernetes.
- Debug and resolve training-specific issues including gradient norm explosions, non-deterministic behavior across GPU generations, and compute-communication overlap in distributed training (FSDP, DeepSpeed, Megatron-LM).
- Optimize RCCL collective communication patterns for training workloads, including all-reduce, all-gather, and reduce-scatter across multi-node topologies.
- Develop monitoring, alerting, and compliance infrastructure to ensure training cluster health, data security, and SLA adherence at scale.
- Design and build end-to-end validation and testing infrastructure using proxy workloads, synthetic benchmarks, and configurable workload generators to systematically validate platform readiness across AMD Instinct GPU generations.
- Write and optimize high-performance GPU kernels (GEMM, attention, quantized matmul, GPTQ/AWQ) in HIP, Triton, and MLIR targeting AMD Instinct architectures, with demonstrated ability to outperform open-source baselines.
- Drive end-to-end inference enablement on new AMD GPU silicon - be among the first to get frontier models running on each new Instinct generation, creating reproducible guides and reference implementations.
- Optimize inference serving frameworks (vLLM, SGLang, TorchServe) for AMD GPUs: batching strategies, KV-cache management, speculative decoding, and continuous batching for production throughput/latency targets.
- Develop novel approaches to inference acceleration, including bio-inspired algorithms, SLM-assisted batching, and custom scheduling strategies that exploit AMD hardware characteristics.
- Build quantization pipelines (FP8, FP6, FP4, GPTQ, AWQ) for production model deployment, ensuring quality-performance tradeoffs are well-characterized across AMD GPU generations.
- Collaborate with AMD silicon architecture and pre-silicon teams to provide software feedback and validate software stack integration on next-generation Instinct GPU designs for both training and inference workloads.
- Build observability and automated analysis tooling: log analysis pipelines, anomaly detection, performance baselining, regression detection, and diagnostic workflows for large-scale GPU clusters.
- Contribute to the open ROCm ecosystem and AMD's developer experience — SDKs, CI dashboards, documentation, and developer cloud enablement.
- Direct experience enabling frontier models (GPT-4 class) on AMD Instinct hardware end-to-end.
- Background in building anomaly detection, log analysis, or observability systems for large-scale distributed GPU infrastructure.
- Familiarity with AMD Instinct MI-series architectures (MI300X, MI350X, MI355X) and RCCL communication library.
- Contributions to open-source AI frameworks (PyTorch, vLLM, SGLang, DeepSpeed, Megatron-LM).
- Experience designing validation frameworks, proxy benchmarks, or synthetic workload suites for GPU infrastructure at scale.
- Experience with pre-silicon software validation or hardware-software co-verification workflows.
- Publications or patents in HPC, ML systems, or GPU kernel optimization.
- Bachelor’s or Master’s degree or Ph.D in Computer/Software Engineering, Computer Science, or related technical discipline