Lead Platform Reliability Engineer, Global AI Platform & Solutions
Manulife View all jobs
- Toronto, ON
- Permanent
- Full-time
- Reliability and performance: Define SLOs/SLIs, track operations budgets, reduce MTTR, capacity plan, and tune autoscaling.
- Observability: Build and maintain logging, metrics, tracing, and alerting; instrument platform components; create runbooks and dashboards.
- Incident response: On-call for platform incidents; triage, mitigate, root-cause, and drive postmortems and corrective actions.
- Automation and tooling: Develop self-service capabilities, AIOps/MLOps/GitOps/CICD pipelines, and operational automations (provisioning, upgrades, backups).
- Infrastructure as code: Manage clusters, networks, storage, and policies via Terraform/Ansible; prevent configuration drift.
- Security and compliance: Enforce identity/RBAC, secrets management, supply chain security, and regulatory controls; collaborate with risk and audit.
- Scalability and cost: Optimize resource usage, plan capacity, control spend (rightsizing, autoscaling, reservations/spot).
- Change management: Safe rollouts, progressive delivery, and policy-as-code guardrails.
- Platform productization: Treat the platform as a product, define operations SLAs in alignment to product roadmap, service catalog, and developer experience.
- Collaborate with global engineering, security, and AI governance teams to ensure compliance with cross-geo regulations and Asia’s data residency requirements.
- Operate scalable backend services supporting high-traffic agent interactions, retrieval operations, and real-time execution flows.
- Maintain AI services runbooks, playbooks, and enablement for GOCC
- Bachelor’s in Computer Science/Engineering or equivalent experience (not strictly required if skills demonstrated).
- 5-8 years experience in DevOps/Platform Engineering or Production Operations.
- Proven track record operating large-scale distributed systems and running on-call.
- Operational experience with cloud-native development: Azure, Kubernetes, containers, CI/CD, and observability stacks.
- Knowledge with Python and/or Java/Scala/TypeScript for building backend services and automation.
- Understanding of AI solution, LLM systems, retrieval architectures, embeddings, vector stores, prompt/tool orchestration, and agent workflow fundamentals.
- Knowledge of API design, asynchronous workflows, concurrency, reliability engineering (SLOs, error budgets), and performance tuning.
- Familiarity with security, governance, and compliance for AI/data systems (authN/authZ, data protection, audit logging, model governance).
- Ability to collaborate across global teams and translate business requirements into platform capabilities and operational SLAs.
- ITIL & ITSM certification
- Azure Administrator/DevOps certificate (nice to have)
- Kubernetes: CKA/CKS certificate (nice to have)
- HashiCorp Terraform Associate certificate (nice to have)
- We’ll empower you to learn and grow the career you want.
- We’ll recognize and support you in a flexible environment where well-being and inclusion are more than just words.
- As part of our global team, we’ll support you in shaping the future you want to see.