Lead Platform Engineer - Global AI Platform
Manulife View all jobs
- Toronto, ON
- Permanent
- Full-time
- Builds and maintains high-performance, fault-tolerant, secure, and scalable AI platform services and abstractions that support diverse AI solutions with automation-first delivery.
- Designs, builds, and maintains the technology platform's features and infrastructure, including hardware, software, and network components.
- Build scalable microservices and event-driven pipelines for model training/inference using Akka Streams and Cluster Sharding.
- Integrate AdaptiveML workflows for continuous/online learning, feature stores, model registries, and A/B experimentation.
- Implement AI Foundry components for orchestration, feature engineering, model deployment, and governance.
- Develop reusable reference patterns, inner-source components that meet reliability, security, and compliance standards.
- Implement shared runtimes for multi agent coordination, state management, memory persistence, and messaging.
- Design interoperable APIs/SDKs used by data scientists and developers to build agent powered applications.
- Maintain and improve CI/CD pipelines and developer toolchains for AI services to enable rapid, compliant delivery.
- Evaluate emerging AI/ML infrastructure capabilities; prototype and introduce tools that improve developer productivity and reliability.
- Develop and operate scalable backend services supporting high traffic agent interactions, retrieval operations, and real time execution flows.
- Use cloud native technologies (containers, orchestration, IaC, CI/CD) to deliver reliable, cost-efficient services.
- Optimize runtime performance across CPU/GPU/accelerator workloads.
- Monitors and resolves persistent platform issues when surfaced by technical support teams such as bottlenecks, connectivity problems, and system failures.
- Considers compliance and regulatory requirements throughout the platform lifecycle. Implements security measures, such as access controls, encryptions, and vulnerability assessments when applicable.
- Partners with architects and business leaders to design and build robust platforms across all Global AI Platform capability layers.
- 7–10+ years in software engineering; 3+ years leading teams/projects in AI/ML or distributed systems.
- Strong expertise in Akka (Actors, Streams, Cluster, Typed) and event-driven microservices at scale.
- Hands-on experience with AI Foundry and AdaptiveML (or equivalent platforms for model lifecycle, orchestration, and continuous learning).
- Proficiency in Scala or Java (Akka ecosystem), plus Python for ML tooling.
- Experience with stream processing and data pipelines.
- Solid MLOps background: model registries, feature stores, CI/CD for ML, containerization (Docker), orchestration (Kubernetes).
- Cloud proficiency (AWS/Azure), Terraform or IaC, and secrets/IAM.
- Deep understanding of distributed systems, observability stack and resilience patterns.
- Strong communication, documentation, and stakeholder management skills.
- Experience with online learning, reinforcement learning, or active learning in production.
- Knowledge of responsible AI frameworks, model risk management, and fairness/bias assessment.
- Performance optimization for low-latency inference; GPU/accelerator utilization.
- Experience in regulated industries (e.g., financial services/insurance) with audit and governance requirements.
- We’ll empower you to learn and grow the career you want.
- We’ll recognize and support you in a flexible environment where well-being and inclusion are more than just words.
- As part of our global team, we’ll support you in shaping the future you want to see.