Monitoring tools Engineer
Description
Monitoring tools Engineer
A monitoring tools engineer job description involves deg, implementing, and maintaining systems to monitor the performance and health of networks, applications, and infrastructure. Key responsibilities include configuring monitoring tools, creating dashboards and alerts, automating processes, analyzing data to identify issues, and providing technical support and documentation. This role requires strong technical skills in areas like scripting, observability (metrics, logs, traces), and various operating systems and cloud environments. Core responsibilities
. Tool management: Install, configure, and maintain monitoring tools and platforms across different environments (e.g., cloud, on-premises). . Monitoring and alerting: Establish comprehensive monitoring to track system performance, application availability, and infrastructure health, including setting up actionable alerts. . Automation and scripting: Develop scripts and automated processes to streamline tasks like agent deployment, data collection, and reporting. . Data analysis: Analyze metrics, logs, and traces to identify performance bottlenecks, troubleshoot issues, and perform root cause analysis for incidents. . Incident response: Act as a point of escalation for monitoring-related issues, collaborating with other teams to resolve major incidents and ensure proper documentation and follow-up. . Documentation: Create and maintain detailed documentation, standard operating procedures (SOPs), and knowledge base articles. . Collaboration and support: Work with application and infrastructure teams to define monitoring requirements, integrate observability into CI/CD pipelines, and provide training and support to other teams. . Reporting: Generate and distribute performance and status reports based on collected monitoring data.
Required skills and qualifications
. Technical expertise: Experience with monitoring tools SolarWinds, Azure Native tools, Infoblox for DDI(DNS,DHCP,IPAM), ORION. operating systems (Windows, Linux), networking protocols (TCP/IP, SNMP), and cloud platforms (AWS, Azure). . Scripting and automation: Proficiency in scripting languages such as Python or PowerShell. . Observability: Experience with Application Performance Monitoring (APM) and the three pillars of observability: metrics, logs, and traces. . Troubleshooting: Strong analytical and problem-solving skills to diagnose and resolve complex technical issues. . Communication: Excellent verbal and written communication skills for interacting with technical teams and stakeholders. . Soft skills: Ability to work independently, manage time effectively, and collaborate with others in a team environment. . Certifications: ITIL, Windows, or cloud-related certifications are often a plus.
Skills:
Incident Response, Network