
AIOps Specialist
- Montreal, QC
- Permanent
- Full-time
- Model Deployment and Integration: Design, build, and maintain the infrastructure for deploying generative AI models like Stable Diffusion, FLUX, and various Large Language Models (LLMs) into production environments.
- Hands-on Implementation: Install, configure, manage, and troubleshoot new and untested Python-based AI applications and libraries on Linux systems.
- Rapid Prototyping: Quickly develop and code proofs-of-concept and interactive demos to showcase the potential of new AI models and features.
- Model Customization and Training: Take the lead on training diffusion models and fine-tuning LLMs to meet specific project requirements.
- Dataset and Asset Creation: Build and manage high-quality datasets for training and create custom LoRA (Low-Rank Adaptation) models to adapt foundation models for specialized tasks.
- AIOps Pipeline Management: Develop and maintain CI/CD pipelines for machine learning, automating the processes of model training, evaluation, and deployment.
- Full-Stack Development: Write and adjust code across the stack (Python, PHP, Bash, HTML, JS) to ensure seamless integration of AI models with our platforms.
- Performance Monitoring: Collaborate with DevOps and Engineering teams to monitor model performance, troubleshoot integration issues, and ensure scalable, secure deployments using tools like vLLM for high-throughput serving.
- Bachelor's degree in Computer Science/Engineering, IT, or a related field.
- 3-5 years of experience in a DevOps, AIOps, or Backend Engineering role.
- Expert-level knowledge of Python and standard ML frameworks (e.g., PyTorch, TensorFlow).
- Strong proficiency in PHP, Bash, HTML, and JavaScript for full-stack integration.
- Extensive, hands-on experience with Linux environments and command-line operations.
- Proven experience deploying and managing generative AI models, specifically Stable Diffusion, FLUX models, and various LLMs.
- Demonstrated experience with the entire model customization lifecycle: building datasets, training diffusion models, and fine-tuning LLMs.
- Practical experience creating and applying LoRA for model adaptation and using serving frameworks like vLLM.
- A love for making things efficient, reliable, and a proactive mindset to take initiative.
- Experience with containerization and orchestration tools like Docker and Kubernetes.
- Familiarity with building and managing CI/CD pipelines and using infrastructure-as-code tools (e.g., Terraform).
- Industry Certifications (e.g., AWS, GCP, Azure, CKA/CKAD).
- Professional context switcher who can adeptly move between different projects and technologies.