ArcMultiple locations

Mid-level Software Engineer (Generative AI Cloud Infrastructure) - Perm - US/UK/Europe

Posted Yesterday

Description

Mid-Level Software Engineer

AI Cloud & LLM Infrastructure Full-Time · Remote or Hybrid · Founding Team Opportunity

About Us

We are building a Gen AI Acceleration Cloud an end-to-end platform for the full generative AI lifecycle. Our focus is to deliver blazing-fast LLM inference, scalable fine-tuning, and modern AI cloud infrastructure that GPUs, SmartNICs/DPUs, and ultra-fast networking fabrics.

Our platform powers mission-critical workloads with: ● On-demand & managed Kubernetes clusters ● Slurm-based training clusters ● High-performance inference services ● Distributed fine-tuning and eval pipelines ● Global data centers &heterogeneous GPU fleets We are looking for a jr-mid Software Engineer to design, build, and scale the core systems behind our AI cloud.

What You’ll Work On

AI Cloud Infrastructure

Develop and maintain reliable backend services running across cloud data centers.
Assist in building automation for GPU management, VM provisioning, and high-throughput storage systems.
Contribute to distributed systems and pipelines that support AI workloads.

LLM & GPU Virtualization Platform

Help build the software layer for GPU clusters with modern accelerators (H100, GB200, GB300).
Work on GPU virtualization and management (PCIe passthrough, MIG, SR-IOV) under guidance.
Support scaling and optimization of storage and data systems for AI training datasets.

Observability, Reliability & Automation

Contribute to monitoring and observability stacks (Prometheus, Grafana, OpenTelemetry).
Help implement automated node lifecycle management for distributed training and inference.
Assist in building testing frameworks for resiliency and fault tolerance.

Core Platform Engineering

Contribute to internal and open-source platform components.
Build developer tooling, SDKs, and documentation for platform services.
Support research and implementation for decentralized AI workloads under senior guidance.

Requirements

2–5 years of production software engineering experience.
Proficiency in at least one backend language (Golang preferred; Python or Rust also valued).
Experience contributing to distributed systems or high-performance services.

Cloud & Systems Knowledge

Familiarity with cloud platforms (AWS, GCP, or Azure) and distributed microservices.
Understanding of concurrency, memory management, and high-performance I/O.
Exposure to system design and reliability concepts.

Infrastructure / DevOps Skills (Plus)

Experience with Kubernetes, Docker, or similar container orchestration.
Familiarity with Terraform, Ansible, CI/CD pipelines, and monitoring tools.

Virtualization & Compute (Optional / Nice to Have)

Exposure to GPU virtualization, CUDA, or distributed ML training stacks.
Basic understanding of hypervisors or PCIe passthrough.

Networking (Optional / Nice to Have)

Familiarity with VLAN/VXLAN, RDMA/Infiniband, or high-performance networking concepts.

Responsibilities

Build and maintain backend and infrastructure components for AI workloads.
Collaborate with senior engineers on GPU clusters, storage systems, and virtualization platforms.
Assist in end-to-end service delivery from design to operation.
Contribute to testing frameworks and automation for reliability.
Work closely with cross-functional teams including ML engineers, product, and hardware teams.

Who You Are

A technically curious engineer who enjoys complex systems work.
Able to communicate ideas clearly and document work for others.
Motivated by building infrastructure that supports cutting-edge AI.
Collaborative, adaptable, and comfortable in a fast-moving startup environment.

Skills

OpenTelemetryMachine LearningLLMKubernetesPlatform EngineeringCI/CDAIMLAzureTerraformPrometheusAnsibleDevOpsDockerMicroservicesPythonGrafanaAWSRustSystem DesignGCPGo