Systems Engineer (HPC)
Description
We are seeking a highly skilled Systems Engineer specializing in High Performance Computing (HPC) to support, maintain, and optimize our HPC infrastructure. The ideal candidate has deep technical expertise, hands-on experience with HPC environments, and a strong understanding of performance engineering, systems operations, and automation.
Project Start: ASAP Project Duration: Until December 2026 Location: Remote (with on‑site onboarding in Cologne) English: Fluent German: as a plus
Responsibilities Incident & Service Operations Incident Management: Respond to, diagnose, and resolve HPC-related incidents to ensure system stability and minimize downtime. Service Request Management: Process and fulfill service requests related to HPC resources, tooling, and services.
Technical Tasks Troubleshooting: Investigate and resolve complex technical issues across HPC clusters, applications, networking, and performance workflows. Testing & Validation: Develop, execute, and document test plans to validate system reliability, scalability, and performance. Documentation: Create and maintain detailed documentation on system architecture, configurations, workflows, and optimizations. Manage, monitor, and optimize HPC clusters, job scheduling systems, and related infrastructure. Analyze performance bottlenecks and apply optimization techniques across compute, memory, and networking layers. Support software development, integration, and deployment workflows within HPC environments.
Required Qualifications Minimum 3 years of experience in software development and/or systems engineering with a strong focus on HPC environments. Expertise in Linux operating systems, specifically Red Hat Enterprise Linux (RHEL). Strong programming/scripting skills: C, C++, Python, Bash, Ansible Hands-on experience with parallel computing frameworks: MPI, OpenMP, CUDA Solid knowledge of computer architecture, performance tuning, and system optimization. Experience managing HPC clusters, including job schedulers (e.g., Slurm, PBS, LSF). Strong networking knowledge, particularly InfiniBand. Understanding of ITIL best practices, especially: Incident Management, Service Management, Process Optimization
Soft Skills Strong analytical and problem-solving capabilities Ability to work in distributed, remote teams Clear communication and documentation skills Proactive, structured, and solution-oriented mindset
Skills
Want AI to find more roles like this?
Upload your CV once. Get matched to relevant assignments automatically.