Senior Platform Reliability Engineer - Global AI Platform
Manulife View all jobs
- Toronto, ON
- Permanent
- Full-time
- Provides reliable and scalable platform experience to the Global AI Platform Users
- Responsible for monitoring, analyzing, optimizing and maintaining software environment to best support testing and deployment in continuous integration/continuous delivery environment.
- Develop self-service capabilities, AIOps/MLOps/GitOps/CI/CD pipelines, and operational automations (provisioning, upgrades, backups).
- Manage clusters, networks, storage, and policies via Terraform/Ansible; prevent configuration drift.
- Enforce identity/RBAC, secrets management, supply chain security, and regulatory controls; collaborate with risk and audit.
- Optimize resource usage, plan capacity, control spending (rightsizing, autoscaling, reservations/spot).
- Safe rollouts, progressive delivery, and policy-as-code guardrails.
- Resolves persistent platform issues when surfaced by technical support teams
- Provides performance enhancements through automation and pushes for enhanced reliability of platform to support product development
- Delivers resilient and scalable applications, with a focus on continuous delivery and operational insight
- Collaborates with platform and software engineers, platform reliability engineers, Product Owners, and engineering leadership to uncover pain points and opportunities to accelerate the delivery of new value through software
- Investigates new platform solutions to enhance service delivery experience
- Resolves persistent platform issues when surfaced by technical support teams
- Delivers good user experience to other engineers, with a focus on self-service and continuous delivery
- Addresses incidents and problems, with rotational accountability for on-call support
- Familiarity with agile and DevOps principles, test-driven development, continuous integration, and other approaches to accelerate the delivery of new features
- Understanding of software development lifecycle
- Understanding of how technology supports Manulife business strategy
- Deep understanding of DevOps principles, prioritizes platform over products
- Attends advanced training sessions and is certified on multiple domains of expertise
- Demonstrates all core skills, and good interpersonal skills for the role
- Good working and background knowledge of area of practice
- Use and combine knowledge of the discipline and the market to formulate the right approach
- Participates in functional demos utilizing new tech; designs own control structures
- Sees actions partly in terms of longer-term goals
- Understands the corporate climate & culture
- Strong knowledge of the business
- Experience with virtual infrastructure, CICD tools such as Jenkins, Github, TeamCity etc.
- Experience in languages such as Python, Java, JavaScript, .NET, HTML5, CSS3, Swift and/or similar technologies
- Understanding of systems monitoring tools and analytics (New Relic, MoogSoft, xMatter, etc.)
- Experience with Cloud Foundry and other components supporting a highly-automated global engineering platform
- Collaborative attitude, willingness to work with team members; able to coach, participate in code reviews, share skills and methods
- Constantly learns from both success and failure
- Experience with open-source technologies preferable
- Good organizational and problem-solving abilities that enable you to manage through creative abrasion
- Good verbal and written communication; able to effectively articulate technical vision, possibilities, and outcomes
- Experiments with emerging technologies and understanding how they will impact what comes next.
- Bachelor’s in Computer Science/Engineering or equivalent experience (not strictly required if skills demonstrated).
- 5–8+ years in DevOps/Platform Engineering or Production Operations (8+ preferred for senior level).
- Proficiency in Python and/or Java/Scala/TypeScript for backend services and automation.
- Hands on experience with Azure, Kubernetes, containers, CI/CD, and observability stacks.
- Strong understanding of LLM systems, retrieval architectures, embeddings, vector stores, prompt/tool orchestration, and agent workflow fundamentals.
- Expertise in API design, asynchronous workflows, concurrency, reliability engineer concepts (SLOs, error budgets), and performance tuning.
- Familiarity with security, governance, and compliance for AI/data systems (authN/authZ, data protection, audit logging, model governance).
- Proven track record operating large scale distributed systems and running on call.
- Ability to collaborate across global teams and translate business needs into platform capabilities and operational SLAs.
- We’ll empower you to learn and grow the career you want.
- We’ll recognize and support you in a flexible environment where well-being and inclusion are more than just words.
- As part of our global team, we’ll support you in shaping the future you want to see.