Hadoop + Production Support
Description
Monitor Hadoop ecosystem (HDFS, Hive, Spark, Oozie, HBase, Kafka, Yarn, MapReduce) using enterprise monitoring tools (e.g., Nagios, Grafana, Cloudera Manager, Ambari). Perform health checks for Hadoop clusters, nodes, and jobs to ensure high availability and stability. Respond to alerts, incidents, and service requests, ensuring timely resolution or escalation to L2/L3 teams as per runbooks. Support ETL batch processing, job scheduling (Control-M, CA7, Oozie, Airflow), and resolve basic job failures. Execute predefined scripts, queries, and operational procedures for routine tasks. Document incidents, root causes, and resolutions in knowledge base for continuous improvement. Maintain SLA adherence for P1–P4 incidents, including proactive follow-ups and communication with stakeholders. Collaborate with cross-functional teams (L2 engineers, developers, DBAs, infra teams) to support production stability. Provide daily/weekly reports on incident trends, recurring issues, and cluster performance.
Skills:
Grafana, Hadoop Ecosystem (HDFS), Incident Management