
High-Performance Computing Team Leader
- Ontario
- Permanent
- Full-time
- Monitoring, maintaining, and troubleshooting HPC nodes.
- Managing hardware, software installation, and configuring and optimizing the environment.
- Identify, diagnose, and resolve level two issues for users of the software and hardware.
- Act as second line support for issues escalated by the first line support.
- Oversee the day-to-day operations and support of the HPC environment in Linux.
- Take ownership of capacity, availability and performance of the HPC clusters.
- Migration of existing nodes.
- Implementation and management of a system based on Foreman (or similar) to manage patching and oversee cluster management.
- Implement patches and upgrades to Linux, Slurm, and OpenHPC.
- Install new servers and storage, build new clusters, configure and manage Linux distributions, hypervisors (KVM) and tooling.
- Day-to-day operational support.
- Diploma or Degree in a relevant area of study. Computer Science together with demonstrated operational network-related experience is preferred.
- Experience in HPC tools such as Slurm, OpenHPC, LSF or GridEngine.
- Experience in the use of KVM or other hypervisors.
- Knowledge of HPC clusters and use cases.
- Experience in the installation and operation of Linux platforms in an Enterprise environment (Ubuntu/RedHat).
- Experience with identity management using Microsoft Identity Manager and Azure AD Connect.
- Industry certifications such as MCSE, CISSP are desired.