Selby JenningsChicago, IL, USA

High Performance Computing Specialist | Chicago, IL, USA | Hybrid

Description

An elite Montreal based Trading Firm is seeking an HPC Systems Specialist to join a team responsible for deg and operating high performance GPU platforms that support advanced AI and machine learning workloads. An elite Montreal based Trading Firm is seeking an HPC Systems Specialist to join a team responsible for deg and operating high performance GPU platforms that support advanced AI and machine learning workloads. This role sits at the intersection of infrastructure engineering, distributed systems, and performance tuning, with ownership spanning from physical hardware through large‑scale model serving. You will work closely with ML practitioners and infrastructure peers to build reliable, scalable, and highly optimized compute environments. What You'll Do • Build, operate, and continuously improve GPU-based compute platforms supporting large-scale inference and ML workloads • Design and deploy distributed model serving architectures across multi-node, multi-GPU environments • Operate and evolve Kubernetes clusters with GPU scheduling for AI and ML use cases • Configure and tune networking components such as load balancers, firewall rules, and high-throughput interconnects for GPU clusters • Develop and optimize storage solutions for model artifacts, checkpoints, and inference caches • Diagnose and resolve performance and stability issues across hardware, drivers, networking, and application layers • Partner with ML engineers to benchmark models, analyze performance characteristics, and apply inference acceleration strategies • Evaluate new GPU hardware, serving frameworks, and infrastructure patterns to improve efficiency and scalability • Improve system reliability through observability, alerting, capacity planning, and on-call/incident response processes • Automate provisioning and lifecycle management using infrastructure-as-code and scripting What You Bring • Bachelor's or Master's degree in Computer Science, Engineering, or a related discipline • 5+ years of experience in managing high performance computing environments • Hands-on experience operating GPU compute environments for ML inference or training • Familiarity with modern model serving frameworks (e.g., vLLM, SGLang, or similar) and GPU driver/runtime management • Strong Linux systems expertise, including networking, storage, and kernel-level performance considerations • Practical experience running GPU workloads on Kubernetes at scale • Experience with infrastructure automation tools such as Terraform, Ansible, or equivalent • Solid understanding of distributed systems concepts, networking fundamentals (TCP/IP, HTTP/2), and load-balancing strategies • Proficiency in Python and shell scripting for tooling and automation • Experience with monitoring and observability platforms such as Prometheus, Grafana, or comparable tools This is a hybrid role in the firms Montreal office requiring 3 days per week onsite, and 2 days remote.

Skills

Machine LearningShellTerraformPrometheusLinuxAnsibleGrafanaKubernetesPythonMLAI

Want AI to find more roles like this?

Upload your CV once. Get matched to relevant assignments automatically.

Try personalized matching