ArcMultiple locations

Mid-level Software Engineer (Generative AI Cloud Infrastructure) - Perm - US/UK/Europe

Description

Mid-Level Software Engineer

  • AI Cloud & LLM Infrastructure Full-Time · Remote or Hybrid · Founding Team Opportunity

About Us

We are building a Gen AI Acceleration Cloud an end-to-end platform for the full generative AI lifecycle. Our focus is to deliver blazing-fast LLM inference, scalable fine-tuning, and modern AI cloud infrastructure that GPUs, SmartNICs/DPUs, and ultra-fast networking fabrics.

Our platform powers mission-critical workloads with: ● On-demand & managed Kubernetes clusters ● Slurm-based training clusters ● High-performance inference services ● Distributed fine-tuning and eval pipelines ● Global data centers &heterogeneous GPU fleets We are looking for a jr-mid Software Engineer to design, build, and scale the core systems behind our AI cloud.

What You’ll Work On

AI Cloud Infrastructure

  • Develop and maintain reliable backend services running across cloud data centers.
  • Assist in building automation for GPU management, VM provisioning, and high-throughput storage systems.
  • Contribute to distributed systems and pipelines that support AI workloads.

LLM & GPU Virtualization Platform

  • Help build the software layer for GPU clusters with modern accelerators (H100, GB200, GB300).
  • Work on GPU virtualization and management (PCIe passthrough, MIG, SR-IOV) under guidance.
  • Support scaling and optimization of storage and data systems for AI training datasets.

Observability, Reliability & Automation

  • Contribute to monitoring and observability stacks (Prometheus, Grafana, OpenTelemetry).
  • Help implement automated node lifecycle management for distributed training and inference.
  • Assist in building testing frameworks for resiliency and fault tolerance.

Core Platform Engineering

  • Contribute to internal and open-source platform components.
  • Build developer tooling, SDKs, and documentation for platform services.
  • Support research and implementation for decentralized AI workloads under senior guidance.

Requirements

  • 2–5 years of production software engineering experience.
  • Proficiency in at least one backend language (Golang preferred; Python or Rust also valued).
  • Experience contributing to distributed systems or high-performance services.

Cloud & Systems Knowledge

  • Familiarity with cloud platforms (AWS, GCP, or Azure) and distributed microservices.
  • Understanding of concurrency, memory management, and high-performance I/O.
  • Exposure to system design and reliability concepts.

Infrastructure / DevOps Skills (Plus)

  • Experience with Kubernetes, Docker, or similar container orchestration.
  • Familiarity with Terraform, Ansible, CI/CD pipelines, and monitoring tools.

Virtualization & Compute (Optional / Nice to Have)

  • Exposure to GPU virtualization, CUDA, or distributed ML training stacks.
  • Basic understanding of hypervisors or PCIe passthrough.

Networking (Optional / Nice to Have)

  • Familiarity with VLAN/VXLAN, RDMA/Infiniband, or high-performance networking concepts.

Responsibilities

  • Build and maintain backend and infrastructure components for AI workloads.
  • Collaborate with senior engineers on GPU clusters, storage systems, and virtualization platforms.
  • Assist in end-to-end service delivery from design to operation.
  • Contribute to testing frameworks and automation for reliability.
  • Work closely with cross-functional teams including ML engineers, product, and hardware teams.

Who You Are

  • A technically curious engineer who enjoys complex systems work.
  • Able to communicate ideas clearly and document work for others.
  • Motivated by building infrastructure that supports cutting-edge AI.
  • Collaborative, adaptable, and comfortable in a fast-moving startup environment.

Skills

OpenTelemetryMachine LearningLLMKubernetesPlatform EngineeringCI/CDAIMLAzureTerraformPrometheusAnsibleDevOpsDockerMicroservicesPythonGrafanaAWSRustSystem DesignGCPGo