
Lead Site Reliability Administrator
- Waterloo, ON
- Permanent
- Full-time
OpenText is a global leader in information management, where innovation, creativity, and collaboration are the key components of our corporate culture. As a member of our team, you will have the opportunity to partner with the most highly regarded companies in the world, tackle complex issues, and contribute to projects that shape the future of digital transformation.OpenText™ Cybersecurity SMB & Consumer solutions help organizations protect their most valuable and sensitive information with products, services, and training. The OpenText Cybersecurity portfolio provides a broad portfolio of data security and data protection solutions that help companies prevent, detect, respond, and recover from the latest cybersecurity threats.As a global leader in secure information management, OpenText empowers businesses to stay ahead of the ever-evolving cyber threats. Our Cybersecurity Enterprise portfolio is formidable, offering innovative solutions that safeguard organizations from malicious attacks, data breaches, and cyber vulnerabilities. By joining our team, you'll be at the forefront of developing and implementing state-of-the-art security technologies, protecting critical assets and sensitive information for clients worldwide.Your Impact:As a Lead Site Reliability Administrator, you will be responsible for architecting, implementing, and maintaining resilient cloud infrastructure across AWS and Azure. You will champion reliability engineering principles, drive automation, and collaborate closely with development, product, and operations teams to ensure our systems are scalable, secure, and performant. This role demands deep technical expertise, strategic thinking, and a passion for continuous improvement.As a Lead Site Reliability Administrator, you will:
- Architect and maintain scalable, secure, and highly available cloud infrastructure in AWS and Azure, with a focus on automation and reliability.
- Lead the implementation, management and optimization of Kubernetes (EKS) environments, including service mesh, autoscaling, and GitOps workflows.
- Collaborate closely with development and product teams to ensure infrastructure supports rapid delivery, performance, and resilience of applications.
- Build and maintain CI/CD pipelines using Jenkins, GitLab, and Terraform, enabling consistent and efficient deployment practices.
- Drive observability and monitoring strategies using tools like CloudWatch, Prometheus, Grafana, and OpenTelemetry to ensure system health and performance.
- Apply SRE principles such as SLIs, SLOs, and error budgets to guide operational decisions and improve service reliability.
- Cloud Platforms: Deep expertise in AWS (EKS, Lambda, RDS, PrivateLink, WAF, NLB, API Gateway, DynamoDB); Istio. Working knowledge of Azure.
- Operating Systems: Proficiency in Linux administration, and working knowledge of Windows.
- Scripting & Automation: Strong skills in Bash and Python.
- Monitoring & Observability: Experience with CloudWatch, Grafana, Prometheus, Zabbix, OpenTelemetry.
- CI/CD & IaC: Proven experience with Jenkins, GitLab, and Terraform.
- Service Management: Administration of OpenSearch and other third-party services.
- Experience managing MongoDB Atlas, ArangoDB, or Confluent Kafka.
- Familiarity with reliability engineering principles: SLOs, SLIs, error budgets, and service level management.
- Demonstrated success as a steward of change; driving adoption of agile methodologies, automation, and process improvements.
- Strong project management skills, including stakeholder engagement and collaboration with developers and product managers.
- Solid understanding of version control systems (Git), including branching strategies and release management.