Site Reliability Principal Specialist, IT Operations
Sherweb View all jobs
- Quebec City, QC
- $91,000-130,000 per year
- Permanent
- Full-time
- Define and evolve reliability standards across platforms and services, including service level objectives (SLOs), service level indicators (SLIs), to improve mission-critical services.
- Establish a shared reliability language and expectations across IT Operations Teams.
- Drive consistency in monitoring and operational practices across services, systems and platforms.
- Influence system and operational design to improve reliability, availability and resilience.
- Drive the reduction of operational toil through automation, AI, platform capabilities, and repeatable operational patterns.
- Improve end to end observability and system understanding, enabling teams to reason clearly about system behavior and failure modes. Improves logging, metrics, tracing, and telemetry across systems.
- Enable teams to take end to end ownership of platform reliability, including deeper investigation across infrastructure and application layers.
- Partner closely with infrastructure and platform teams to ensure access, tooling, and visibility support full operational ownership and to drive reliability improvements.
- Act as a reliability advocate and technical advisor during operational reviews, incident learning, and platform evolution.
- Partner closely with DevOps teams to implement reliability and observability as code, ensuring integration with CI/CD pipelines and platform tooling.
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field, or equivalent practical experience.
- 10+ years of experience in Site Reliability Engineering, operating and improving largescale, production environments.
- Demonstrated experience improving the reliability, availability, and scalability of production systems, platforms and services.
- Handson experience operating distributed systems in business critical and customer facing environments.
- Proven experience reducing manual operational work through automation and standardization.
- Experience defining and applying reliability standards (e.g., SLOs, error budgets) across multiple services or platforms.
- Demonstrated ability to influence technical direction across multiple teams without direct authority.
- Strong understanding of distributed systems, failure modes, and operational resilience.
- Solid experience with observability practices (metrics, logs, traces) and system diagnostics.
- Ability to analyze complex systems end to end across infrastructure, platform, and application layers.
- Strong systems thinking with a track record of addressing reliability issues through design rather than reactive intervention.
- Experience acting as a trusted technical advisor to senior engineers and leaders.
- Ability to clearly communicate complex reliability concepts to both technical and nontechnical stakeholders.
- Cloud platform: Microsoft Azure Solutions Architect Expert or DevOps Engineer Expert would be strong assets.
- Certifications related to reliability, operations, or systems engineering (e.g., Kubernetes, Linux, or observability platforms) are considered an asset.
- Equivalent demonstrated expertise through hands on experience is acceptable in lieu of formal certifications.
- A friendly and diverse work culture with inclusion and equality at the heart of our actions
- State-of-the art technology and tools
- A results-oriented culture where talent, action, and thinking outside the box are given due recognition
- Annual salary review based on performance
- Generous and caring colleagues of various professional and cultural backgrounds
- A base salary ranging between 91 000$ and 130 000$ yearly
- Vacation time that considers your previous experience
- Advanced paid hours to recharge your batteries (holidays and mobile days)
- Flexible benefits plan that adapts to your needs
- Flexible savings fund option
- A monthly home internet allowance
- A career path with opportunities to learn and grow
- Proximity to your direct manager and open, honest communication to support your development
- Multiple initial and on-the-job training opportunities and tools to track your progress and help you scale up in your career