דרושיםLocation:Jerusalem

דרושים»ניהול ביניים» senior sre & Linux infrastructure engineer - ml platform

Project-Based

Description

לפני 22 שעות חברה חסויה Location: Job Type: our ml platform group builds and operates the core infrastructure that powers large scale ai workloads. we manage a massive, high performance environment consisting of both multi cloud clusters and on prem bare metal nodes optimized with ai accelerators. we are looking for a highly experienced senior sre / Linux systems engineer who thrives on managing complex, low level infrastructure. this isn't just a cloud-configuration role, you will be responsible for the health and performance of expensive, high density hardware. you must be an expert at troubleshooting open source systems and "living" inside Linux environments to ensure our ai clusters run at peak efficiency. what will your job look like? build and maintain infrastructure for largescale ai and hpc workloads across onprem and cloud environments operate and enhance our multicloud, multicluster scheduling platform troubleshoot complex issues across the stack: from Kernel -level tuning and drivers to networking, Storage, and distributed system bottlenecks. ensure the reliability of critical platform services: queuing systems, time-series databases, and logging pipelines develop deeply integrated automation and tooling collaborate with ml engineers and it engineers to optimize hardware utilization for data intensive workloads drive best practices in  system  design, observability, and infrastructure-as-codeRequirements: all you need is: 10+ years of handson experience in sre,  Linux administration, or systems engineering expert-level Linux knowledge: deep understanding of system internals, debugging, performance tuning, and the ability to solve failures where hardware meets software. kubernetes expertise: proven experience managing k8s at scale (both managed eks and bare-metal deployments) distributed systems mastery: hands-on experience debugging and maintaining:queuing systems: rabbitmq or similar metrics/observability stacks: prometheus, thanos, and grafana, or similar logging: elasticsearch or similar relational databases: postgresql, or similar infrastructure-as-code: proficiency with terraform, helm, and configuration management networking & scripting: strong fundamentals in networking and proficiency in bash familiarity with gpu/accelerator scheduling, ai/ml pipelines experience with multi cloud architectures and hybrid environments experience with workflow orchestration tools (e.g., argo workflows) what we offer: iimpact: support the engineering that advances our ai and global transportation safety cutting-edge hardware: work with high-value, ai-optimized bare-metal clusters at a massive scale technical depth: a highly technical environment focused on solving deep systems engineering challenges collaboration: work alongside elite ml, software, and systems engineers we change the way we drive, from preventing accidents to semi and fully autonomous vehicles. if you are an excellent, bright, hands-on person with a passion to make a difference come to lead the revolution!This position is open to all candidates. Hide

Skills

KubernetesRabbitmqBashGrafanaPrometheusTerraformLinuxMachine LearningPostgreSQLElasticsearchHelm

Want AI to find more roles like this?

Upload your CV once. Get matched to relevant assignments automatically.

Try personalized matching