ArcMultiple locations

CLI Test and Integration Engineer - Perm - US/UK/Western Europe

Description

CLI Test and Integration Engineer (Chaos testing, integration tests) Full-Time · Remote or Hybrid · High-Impact Role

About Odyn

Odyn is at the forefront of AI innovation, building transformative AI solutions through cutting-edge, high-performance infrastructure. We're seeking a CLI Test and Integration Engineer to design and execute chaos engineering experiments, build integration test suites, and ensure our GPU infrastructure withstands real-world failure scenarios.

What You'll Do

Chaos Engineering & Testing ● Build hypothesis-driven chaos experiments using Gremlin, Chaos Monkey, LitmusChaos, or AWS FIS to inject controlled failures across GPU infrastructure, schedulers, API gateways, and storage layers. ● Design automated integration tests for distributed AI infrastructure components and end-to- end workflows. ● Build CLI testing frameworks for developer and operator tools, validating behavior across environments and edge cases. CI/CD & System Validation ● Embed chaos, integration, and CLI tests into CI/CD pipelines (GitHub Actions, GitLab CI, ArgoCD, Jenkins) with intelligent orchestration and automated rollback. ● Test platform behavior under network partitions, node failures, high-load scenarios, and degraded performance. ● Validate failover mechanisms, data replication, and observability systems during failures. Collaboration &Culture ● Partner with SRE, infrastructure, and backend teams to improve system resilience and testability. ● Conduct architecture reviews to identify weaknesses and create incident response documentation.

What We're Looking For

Must-Have

● 5–7+ years in test automation, chaos engineering, SRE, or distributed systems testing. ● Hands-on chaos engineering experience (Gremlin, Chaos Monkey, LitmusChaos, AWS FIS). ● Strong integration testing experience with distributed systems and cloud-native architectures. ● Proficiency in Python and/or Go; deep experience with pytest, Robot Framework, Playwright, or similar. ● Kubernetes expertise and cloud platform experience (AWS/GCP/Azure). ● CI/CD pipeline integration and strong Linux/Unix skills.

Nice-to-Have

● GPU workload, HPC, or AI/ML infrastructure testing experience. ● High-performance networking (InfiniBand, RoCE, NVLink) or GPU schedulers (Kubernetes, Slurm, Ray). ● Observability stacks (Prometheus, Grafana, OpenTelemetry) or infrastructure-as-code (Terraform, Ansible). ● Prior experience at Netflix, Google, AWS, or AI infrastructure startups.

Why Join Us

● Shape reliability of a cutting-edge AI infrastructure platform from the ground up. ● Work at the frontier of chaos engineering applied to GPU infrastructure and distributed AI systems. ● Collaborate with world-class SRE and infrastructure teams. ● Competitive compensation + remote flexibility.

We strongly encourage applications from those with chaos engineering or distributed systems testing experience for GPU clusters, Kubernetes, or AI/ML platforms.

Skills

GitLab CIGitLabOpenTelemetryMachine LearningGitlabIntegration TestingKubernetesCI/CDAIPlaywrightMLAzureAPISRETerraformPrometheusUnixAnsibleGoJenkinsArgoCDPytestPythonGrafanaAWSGithub ActionsGCPLinuxGitHubGitHub Actions