Site Reliability Engineer (Azure)
Description
ELEKS is looking for a Site Reliability Engineer (Azure).
ABOUT CLIENT The Shared Support team is a specialized, advanced group that provides high-quality support services to various small and midsize organizations worldwide. Our department’s culture prioritizes proactiveness, professionalism, transparency, continuous learning, flexibility, and respect. We aim to establish long-term relationships with our customers, offering dependable support and development practices to improve system reliability and adapt to evolving business needs.
REQUIREMENTS 4+ years of relevant experience Strong knowledge of Azure Hands-on experience with maintaining SQL databases Experience with Kubernetes (deployment, scaling, maintenance) Experience with Linux administration Experience with monitoring tools (CloudWatch, Datadog) Web server configuration skills (Nginx, Apache) Experience with CI/CD Strong network troubleshooting skills Experience with Terraform or CloudFormation (as a plus) CI/CD pipeline. Jenkins, GitLab CI, or similar tools (as a plus) Upper-Intermediate English, both written and spoken
RESPONSIBILITIES Manage complex tickets for timely resolution in accordance with SLA Investigate and replicate customer-reported issues, collaborating with cross-functional teams for resolution Proactively install, configure, and monitor IT systems, performing daily maintenance tasks Provide daily maintenance: backup and restore, new version deployment and roll back, periodic system cleanup, OS and components upgrade, security patching, etc Carry out task XXXX XXXX continuous improvement, ensuring software system reliability Detect, diagnose, and resolve incidents promptly, and conduct post-incident analysis for continuous improvement. Implement and maintain robust monitoring and alerting systems, setting up alerts for proactive response. (APM, DB monitoring, and K8s) Continually analyze system performance in EKS for efficiency, optimizing application code, database queries, and infrastructure settings Use Infrastructure as Code (IaC) tools for defining and provisioning infrastructure, ensuring consistency and reproducibility. Create and maintain the solution documentation, templates, runbooks, and DRPs Be available for on-call shifts and ready to help restore the system if such a need arises (compensated with day off or overtime compensation for the dispatch) Conduct knowledge sharing for junior staff
Skills
Want AI to find more roles like this?
Upload your CV once. Get matched to relevant assignments automatically.