Director Site Reliability Engineering
Bank of the West
- Toronto, ON
- Permanent
- Full-time
- Work in collaboration with Application Engineering, Quality, Product and Data Engineering teams to Champion SRE/ DevOps culture and practices.
- Develop and collaborate with a team of Reliability Engineers working closely with software development, Quality, Product and Data Engineering teams as a Champion of SRE/ DevOps culture and practices.
- Champion SRE principles to address Mean Time to Resolve (MTTR) and Mean Time to Identify (MTTI) issues while maintaining Service Level Objectives (SLO), Service Level Agreements (SLA), and Service Level Indicators (SLI) enabling Observability across end-to-end components and harnessing Chaos Engineering Practices to develop highly available, resilient, and reliable applications and infrastructure that are ready for Production.
- Contribute to management of Service Level Objectives with senior development and business leads.
- Lead initiatives to continuously refine our build, plan and deploy practices for improved stability, reliability, efficiency, repeatability and security. You’ll create plans, collaborate with other SROs and DevOps team members - coordinating activity with development and business leads to increase service levels, lower costs, and support delivery velocity objectives.
- Working with application teams, implement, improve and coach service management best practices to improve overall service delivery.
- Contribute to prioritization of reliability features and contribute to the design, development and delivery of effective tooling, alerts, and automated responses to identify and address reliability risks.
- Contribute to proactive technical communication of reliability, stability and efficiency results (based on Service Level Objectives), service health (via dashboards) key reliability risks and issues to senior business and technology stakeholders.
- Manage a team of System Reliability Engineers who support Finance and ERPM Applications and Services.
- Ensure solutions are automated where possible while improving operational efficiency, reducing operating risk, delivering quality services and optimizing cost.
- Mentor and coach others within assigned area and transfers subject matter expertise to other Systems Reliability Engineers where appropriate.
- Regularly connects work to BMO's purpose, sets inspirational goals, defines clear expected outcomes, and ensures clear accountability for follow through.
- Builds interdependent teams that collaborate across functional and operating groups to create the highest value for all stakeholders.
- 15+ years of work experience in technology (specializing in SRE, DevOps, DevSecOps and cloud computing)
- Proven experience managing large technology platforms of large scale and complexity.
- Understand functional aspects and technical behavior of the underlying operating system, development environment, and deployment practices.
- Strong analytical mindset and good communication skills
- Expert in SRE approach and emerging SRE/Chaos Engineering practices (SLA, SLI, SLO, MTTI, MTTR)
- Hands-on experience with DevOps CICD tools e.g. GitHub, Jenkins, Ansible, Urban Code Deploy
- Hands-on Experience with Docker and/or Kubernetes
- Hands-on Experience with Agile methodologies, e.g. Scrum, Kanban
- Experience with ITSM tools (ServiceNow, a plus) with strong understanding of SRE and service management principles.
- Drive alignment with, and improvement of, broader Enterprise services.
- Apply SRE techniques to DevOps and Compute Services
- Lead hands-on automation and elimination of manual Toil.
- Coach application teams on how to leverage DevOps offerings and help drive productivity gains.
- Partner on or lead new tool adoption. Recommend improvements to process.