2nd Level Service Operations Expert/Site Reliability Engineer (m/f/d) Devops / Kubernetes - Remote
Description
For our customer in the energy sector, we are looking for experienced support as a T2 Service Operations Expert (m/f/d) starting in May. The work takes place remotely and occasionally on site or in Frankfurt by arrangement. General Description Operations within the program are responsible for the day-to-day operation tasks/activities of a hybrid data platform, such as a private cloud, as well as public cloud and towards a future KRITIS infrastructure. T2 Service Operations plays a vital and central role in maintaining the availability and performance. The responsibility lies in the management of incidents and service requests efficiently and effectively, often in critical situations. It involves technical coordination across the project with stakeholders such as customer success, platform delivery, software engineering and others. The focus is continuous improvement and ensuring high performance of the services. Management of both event-driven and planned operations activities, ensuring that incidents are swiftly resolved, problems are addressed and their root, and changes are implemented without causing unplanned downtime. Objective: Ensure Stability and Reliability of Hybrid Cloud Operations Tasks: • Monitoring and managing day-to-day operations of hybrid data platforms (private and public cloud, future KRITIS infrastructure). • Handling of incidents and service requests promptly and with high quality—especially in critical situations. • Acting as the link between Tier 1 (T1) Support and Tier 3 (T3) Operations. Objective: Manage and Resolve Operational Incidents Tasks • Identifying and managing major incidents, including leading Incident Response Teams (IRTs). • Coordination of root cause analysis and ensure sustainable problem resolution. • Expedite, coordination, and escalation of critical situations across different product lines and departments. Objective: Drive Continuous Improvement in Service Operations Tasks • Contributing to service monitoring enhancements, automation, and orchestration. • Promoting and delivering continuous service improvements through dedicated plans. • Reducing unplanned downtime by managing changes and implementing preventative measures. Objective: Maintain Service Knowledge and Onboarding Procedures Tasks: • Recording operational knowledge (in KDB) and maintaining up-to-date procedures. • Ensuring efficient onboarding and offboarding of clients to services. Profile Requirements • Experience in an operational role in vital environments with applications or systems designed based on state-of-art solutions (containerized and distributed). Ideally in a role of a T2 Service Operations Manager. • Experience with containerization and container management including the tools and methods operating containers. • Experience of ITSM frameworks, especially within following processes: incident management, service request management, change management, event management. • Experience with analysis methods (business analytics, metric analysis, KPI management, SLA management) • Experience with automation, orchestration, scheduling and monitoring. • Experience in large scale on-prem cloud projects and in coordination with different stakeholders • Experience in troubleshooting and problem-solving, with a focus on root cause analysis and sustainable solutions. Must-have language skills • fluent English in speech and writing (at least C1) Preferred experience • Exposure to ITSM tools inside the Enterprise IT context • Good understanding of IT infrastructure (network, SAN, virtualization, hyperscale) • Good understanding of Kubernetes
Skills
Want AI to find more roles like this?
Upload your CV once. Get matched to relevant assignments automatically.