Supercomputer System Engineer
Description
Job Overview We are looking for an experienced HPC Systems Engineer to support and operate large-scale Linux-based High-Performance Computing (HPC) environments. This role focuses on maintaining reliable, secure, and high-performance computing platforms that support research, academic, and enterprise workloads. The role involves close collaboration with researchers, engineers, and IT teams to ensure smooth day-to-day operations, resolve technical issues, and optimize system performance. Key Responsibilities HPC Operations and Support Operate, administer, and maintain Linux-based HPC infrastructure, including compute nodes, storage platforms, and high-speed networks Ensure system availability, stability, and performance through proactive monitoring and maintenance Perform patching, upgrades, and capacity planning activities Cluster, Scheduler, and Storage Management Support and manage HPC workload schedulers and resource management platforms Maintain and support parallel and high-performance file systems used in HPC environments Manage cluster provisioning, configuration, and lifecycle activities Incident and Escalation Handling Investigate and resolve infrastructure issues across hardware, operating systems, applications, and networks Participate in on-call or escalation support rotations as required Work closely with software engineering and desktop support teams to address user-related issues User Enablement and Application Support Provide technical guidance to users on running, debugging, and optimizing HPC workloads Support compute-intensive, AI, and data-driven applications Advise users on best practices for application parallelization and performance optimization Training and Knowledge Management Conduct user briefings or training sessions on HPC usage and operational best practices Develop and maintain technical documentation, guidelines, and operational procedures Contribute to continuous improvement initiatives and knowledge sharing within the team Requirements Education and Experience Bachelor’s degree in Computer Science, Engineering, or a related discipline Typically 5 or more years of experience supporting or operating HPC or large-scale Linux environments Technical Skills Strong hands-on experience with Linux operating systems Experience with HPC schedulers and resource management tools Exposure to parallel or distributed file systems Understanding of HPC performance monitoring, tuning, and optimization concepts Added Advantage Experience with HPC application optimization or parallel computing approaches Familiarity with programming languages or libraries commonly used in HPC environments Exposure to scientific, simulation, or compute-intensive workloads Attributes and Soft Skills Strong analytical and troubleshooting skills Self-driven with the ability to work independently and collaboratively Clear written and verbal communication skills Ability to explain complex technical concepts to non-technical users Commitment to continuous
Skills
Want AI to find more roles like this?
Upload your CV once. Get matched to relevant assignments automatically.