ArcMultiple locations

Senior Software Engineer (Generative AI Cloud Infrastructure) - Perm - US/UK/Europe

Posted 4 days ago

Description

Full-Time · Remote or Hybrid · Founding Team Opportunity

About Us

We are building a Gen AI Acceleration Cloud an end-to-end platform for the full generative AI lifecycle. Our focus is to deliver blazing-fast LLM inference, scalable fine-tuning, and modern AI cloud infrastructure that GPUs, SmartNICs/DPUs, and ultra-fast networking fabrics.

Our platform powers mission-critical workloads with: ● On-demand & managed Kubernetes clusters ● Slurm-based training clusters ● High-performance inference services ● Distributed fine-tuning and eval pipelines ● Global data centers &heterogeneous GPU fleets We are looking for a Senior Software Engineer to design, build, and scale the core systems behind our AI cloud.

What You’ll Work On

High-Performance AI Cloud Infrastructure ● Design and maintain fault-tolerant, high-availability backend services running across global data centers. ● Build operators and automation systems for: ○ GPU management ○ Infiniband partitioning ○ VM provisioning ○ High-throughput storage provisioning

LLM & GPU Virtualization Platform ● Build the IaaS software layer for new GPU clusters with thousands of next-gen accelerators (H100, GB200, GB300). ● Work on scalable GPU virtualization (PCIe passthrough, MIG, SR-IOV, VFIO). Massive-Scale Storage & Data Systems ● Contribute to a global multi-exabyte, high-performance object store optimized for pretraining datasets. Build distributed data loaders, caching layers, metadata services, and throughput-optimized pipelines.

Observability, Reliability &Automation ● Develop advanced observability stacks (Prometheus, Grafana, OpenTelemetry).cDesign automated node lifecycle management for large-scale distributed training and inference. ● Build robust testing frameworks for resiliency, failover, and fault tolerance. Core Platform Engineering ● Contribute to the core internal + open-source platform components. ● Write tooling, SDKs, and documentation for developer-facing services. ● Research decentralized AI workloads and build reference architectures.

Requirements

Fundamentals ● 5+ years of production software engineering experience. ● Strong proficiency in one or more backend languages (Golang highly preferred; Rust/Python also valued). ● 5+ years building high-performance, well-tested, production-grade distributed services.

Cloud & Systems Experience ● Experience with distributed microservices across AWS/GCP/Azure. ● Deep understanding of systems fundamentals: ○ Concurrency ○ Memory management ○ High-performance I/O ○ Distributed consensus ○ Large-scale system design

Kubernetes / Infrastructure Expertise (Big Plus) ● Kubernetes internals: custom operators, CRDs, schedulers, or networking/storage plugins. ● Experience with Cluster API, KubeVirt, or similar orchestration tooling. Virtualization / Compute (Big Plus) ● Experience with hypervisors (QEMU/KVM, cloud-hypervisor). ● PCIe passthrough, SR-IOV, GPU virtualization, MIG, NVLink topologies. ● Experience with DPUs/SmartNICs.

Networking (Big Plus) ● Infiniband / RDMA ● VLAN/VXLAN/VPC ● OVS/OVN ● High-performance DC networking

High-Performance Compute (Plus) ● CUDA, NCCL, GPU drivers, parallel training stacks ● Experience with GPU scheduling, workloads, and distributed ML

Infrastructure Automation &Tooling (Expected) ● Terraform, Ansible, CI/CD ● GitHub Actions, ArgoCD ● Prometheus, Grafana, ELK, OpenTelemetry

Preferred Experience

● Built or operated IaaS/PaaS systems ● Experience with large-scale storage systems (Ceph, Lustre, or custom object stores) ● Knowledge of vLLM, TensorRT-LLM, TGI, or other LLM-serving frameworks ● Experience building infra for ML, training, inference, or fine-tuning

Responsibilities

● Perform architecture &research for distributed and decentralized AI workloads. ● Build and maintain foundational infrastructure powering training, inference, and fine-tuning. ● Contribute to core, open-source platform components. ● Own end-to-end services from design → implementation → operations. ● Create testing frameworks for robustness, failover, and performance. ● Collaborate across hardware, product, and ML teams to design next-gen infra.

Who You Are

● A deeply technical engineer who thrives in complex systems work. ● Strong communicator who writes clear design docs. ● Curious, low-ego, and great at collaborating with cross-functional teams. ● Motivated by building world-class AI infrastructure from the ground up. ● Thrives in zero-to-one, fast-moving startup environments.

Compensation

● Competitive salary ● Meaningful early equity ● Benefits ● Salary determined by experience and location.

Skills

OpenTelemetryElkLLMMachine LearningKubernetesPlatform EngineeringCI/CDAIMLAzureAPITerraformPrometheusAnsibleArgoCDGoMicroservicesPythonGrafanaAWSRustGithub ActionsSystem DesignGCPGitHubGitHub Actions