Production Support Engineer
Manulife View all jobs
- Waterloo, ON
- Permanent
- Full-time
You’ll work closely with development teams, operations, and business partners, becoming the steady hands that keep our platforms healthy, scalable, and resilient.Position Responsibilities:
- While this role is not a pure development position, it blends engineering fundamentals with operational excellence. You will spend your time:
- Responding to daytime production support inquiries
- Improving reliability and stability through proactive engineering
- Managing change, incidents, and problems
- Enhancing observability and health systems
- Ensuring issues not only get resolved, but get resolved permanently
- Act as the primary daytime contact for production‑related questions, blocking issues, and support requests.
- Perform initial triage, resolving, and root cause analysis.
- Collaborate with engineering teams to drive permanent fixes.
- Communicate clearly with stakeholders, ensuring visibility and transparency.
- Using App Insights/Azure Monitor, Log Analytics etc. Also experience with SLOs/SLIs, error budgets, MTTR, change failure rate, managing resource needs, DR/BCP, ITIL‑aligned incident/problem/change practices.
- Site Reliability & System Health
- Strengthen system reliability through monitoring, alerting, and proactive maintenance.
- Improve observability using tools like Moogsoft, New Relic, dashboards, logs, and distributed tracing.
- Build or update runbooks to increase operational readiness.
- Contribute to reliability improvements such as reducing alert noise, closing systemic gaps, and improving service resilience.
- Thoughtful use of AI tools and AI-assisted workflows to enhance reliability, observability and resilience
- 3+ years of experience in technical support, DevOps, or an SRE‑adjacent role.
- Strong solving and diagnostic skills across distributed systems.
- Hands‑on experience with observability platforms (e.g., New Relic, Moogsoft).
- Solid understanding of incident, change, and problem management standard processes.
- Proficiency with the ServiceNow ITSM platform.
- Experience with SDLC processes, CI/CD pipelines, Infrastructure as Code (IaC), Blue/Green deployments, and standard release management practices.
- Strong grasp of ITSM processes, particularly the ITIL framework, to ensure consistent alignment to service management standards.
- A data‑driven approach with an “automation‑first” perspective.
- Ability to communicate clearly with both technical and non‑technical audiences.
- A “fix it right” mentality, favoring long‑term solutions over repeated manual interventions.
- Curiosity and a desire to grow in site reliability engineering and deepen your technical expertise.
- Full-Stack Skills: Proficiency in modern frontend (React/Angular) and backend (Java/Node.js/Python) development.Tooling: Experience with observability tools—specifically New Relic, Jenkins, GitHub actions, Snyk, Azure (AKS), SonarQube, Jira.
- The "Fix it Once" Mentality: You prefer writing a script or refactoring a service over manually fixing the same bug twice.Flexibility: Availability to manage production escalations, including after-hours rotations.What You Will GainOn-Call Compensation: Competitive "Beeper Bucks" (on-call stipend and standby compensation) through our On-Call Incentive Program.
- We’ll empower you to learn and grow the career you want.
- We’ll recognize and support you in a flexible environment where well-being and inclusion are more than just words.
- As part of our distributed team, we’ll support you in shaping the future you want to see.