
Senior Site Reliability Engineer
- Brampton, ON
- Permanent
- Full-time
- Champion System Reliability – Lead efforts to monitor, maintain, and enhance the availability, performance, and scalability of complex distributed systems in production.
- Own Incident Response & Resilience Engineering – Drive incident management processes, participate in and lead on-call rotations, perform root cause analyses, and design long-term solutions to prevent recurrence.
- Automate at Scale – Design and implement advanced automation frameworks, tools, and self-healing mechanisms to reduce toil and increase operational efficiency.
- Infrastructure as Code (IaC) Leadership – Architect and maintain infrastructure using tools like Terraform, Ansible, or equivalents, applying version control, modularization, and automation best practices.
- Optimize CI/CD Pipelines – Improve and maintain robust CI/CD workflows, enabling fast, secure, and reliable application delivery with minimal risk.
- Cross-Functional Collaboration & Mentorship – Partner with engineering, platform, and support teams to improve system design and reliability; mentor junior engineers and contribute to a culture of operational excellence.
- Deep understanding of SRE principles and distributed system reliability
- Strong scripting skills (e.g., Python, Bash, Go) for automation and tooling
- Hands-on experience with AWS, GCP, or Azure and IaC tools like Terraform, Ansible
- Proven ability to design and improve CI/CD, observability, and alerting systems
- Experience leading incident response, root cause analysis, and postmortems
- Excellent problem-solving and communication skills; mentorship mindset
Candidates who are 18 years or older are required to complete a criminal background check. Details will be provided through the application process.#EN #SS #LTnA #ON