Site Reliability Engineer
Description
Senior Site Reliability Engineer, Containerisations, Pipeline, GCP, Cloud
We're seeking an experienced Site Reliability Engineer to join the Cloud Enabling team to play a crucial role in maturing our SRE capability and contributing to the resiliency, availability, and security of our infrastructure and software.
Day to day:
Support systems that serve customers and billions of requests monthly, ensuring availability, scalability, and resiliency.
Act as a key technical contributor in liaising with SRE guilds to drive improvements in cloud deployments, monitoring solutions, CI/CD pipelines, and cost optimisation.
Drive innovation by exploring new technologies and methodologies to enhance SRE capabilities, including AI tooling and automation opportunities.
Manage high-throughput systems in production to deliver customer value beyond proof-of-concepts.
Implement SLAs/SLOs/SLIs for software and data teams.
Develop tooling for efficient incident triage, granular alerting, well-defined runbooks, and auto-resolving mechanisms.
Serve as a subject matter expert in engineering conversations related to site reliability, fostering a culture of continuous learning and development.
Proven hands-on experience in software development, testing, monitoring, and operational stability at scale.
Production experience with Kubernetes and monitoring tools such as Datadog or Dynatrace.
Strong knowledge of automation, CI/CD, and best practices.
Experience running postmortems, defining SLAs/SLIs/SLOs, and participating in support rotas.
Coding/scripting experience (Python/Bash) in a commercial setting.
Database knowledge, streaming and batch operations, and API design.
Good background with Kubernetes (ideally microservice architectures using Istio service mesh).
Extensive experience with cloud-native solutions (ideally Google Cloud).
Solid understanding of cloud storage, networking, and resource provisioning.
McGregor Boyall is an equal opportunity employer and do not discriminate on any grounds.