
Senior Cloud Platform Developer
- Ottawa, ON
- Permanent
- Full-time
- Design, deploy, and manage Apache Kafka clusters in development/testing/production environments.
- Proven experience deploying and managing Apache Spark and Apache Flink in production environments.
- Optimize Kafka performance, reliability, and scalability for high-throughput data pipelines.
- Ensure seamless integration of Kafka with other systems and services.
- Manage and troubleshoot Linux-based systems (Ubuntu) supporting Kafka infrastructure.
- Manage, fine-tune, deploy and operate Kafka on Kubernetes clusters, using Helm, Operators, or custom manifests Kafka
- Collaborate with cross-functional teams to identify and implement Kafka use cases.
- Contribute to automation and Infrastructure as Code (IaC) practices through CI/CD pipeline with gitlab
- Monitor system health, implement alerting, and ensure high availability.
- Participate in incident response and root cause analysis for Kafka and related systems.
- Evaluate and recommend Kafka ecosystem tools like Kafka Connect, Schema Registry, MirrorMaker, and Kafka Streams.
- Build automation and observability tools for Kafka using Prometheus, Grafana, Fluent Bit, etc.
- Deep understanding of streaming and batch processing architectures.
- Familiarity with Spark Structured Streaming and Flink DataStream API.
- Work with teams to build end-to-end Kafka-based pipelines for various applications (data integration, event-driven microservices, logging, monitoring).
- Experience running Spark and Flink on Kubernetes, YARN, or standalone clusters.
- Proficiency in configuring resource allocation, job scheduling, and cluster scaling.
- Knowledge of checkpointing, state management, and fault tolerance mechanisms.
- Ability to tune Spark and Flink jobs for low latency, high throughput, and resource efficiency.
- Experience with memory management, shuffle tuning, and parallelism settings.
- Familiarity with Spark UI, Flink Dashboard, and integration with Prometheus/Grafana.
- Ability to implement metrics collection, log aggregation, and alerting for job health and performance.
- Understanding of TLS encryption, Kerberos, and RBAC in distributed environments.
- Experience integrating with OAuth, or other identity providers.
- Familiarity with time-series databases
- 5+ years of experience administering and supporting Apache Kafka in production environments.
- Strong expertise in Linux system administration (Red Hat and Debian).
- Solid experience with Kubernetes (CNCF distributions, OpenShift, Rancher, or upstream K8s ).
- Proficiency in scripting (Bash, Python) and automation tools (Ansible, Terraform).
- Experience with Kafka security, monitoring (Prometheus, Grafana, Istio), and schema management.
- Familiarity with CI/CD pipelines and DevOps practices.
- Proficient in scripting and automation (Bash, Python, or Ansible).
- Comfortable with Helm, YAML, Kustomize, and GitOps, GitLab principles.
- 4+ years of experience in Apache Spark development, including building scalable data pipelines and optimizing distributed processing.