Job Description

Roles & Responsibilities

Responsible for ensuring robust service observability across all squads, enabling proactive monitoring, incident detection, and performance optimization. This role leads reliability initiatives across teams, fostering a culture of resilience and operational excellence. By supporting development teams in gaining deep visibility into systems and user journeys, the SRE helps identify bottlenecks, improve system behavior, and enhance customer experience. The role emphasizes proactive optimization, problem solving, and continuous improvement of reliability, scalability, and availability across the technology landscape. Collaborate with engineering, operations, and other stakeholders to understand monitoring requirements & performance goals. Support teams and define key performance indicators (KPIs) metrics, diagnose issues, and proactively identify areas for optimization. Develop and implement observability processes to enable comprehensive monitoring, logging, and tracing of systems and applications across all teams. Provide proactive approaches to monitoring problems by utilizing existing observability tools and domain expertise. In-depth knowledge of application performance metrics, monitoring, and troubleshooting. Providing expertise in Problem detection, Isolation & RCA during incident management with relevant data and artifacts from observability tools & corresponding systems Provide timely and accurate reports on application performance, highlighting key insights and trends. Collaborate with digital squads to implement performance improvements, including configuration optimizations and infrastructure adjustments. Offer guidance and training to end-users and internal teams on best practices for APM and optimizing application performance. Overall, around 8+ years of experience with IT Infrastructure, Applications 5+ years of hands-on experience in Observability and continuous integration. 2 years of programming background in Java or relevant technologies Deep knowledge of AppDynamics, Grafana, and similar tools Expertise with ELK tools, Splunk, Loki, or OpenSearch Skilled in services and trace correlation Cloud Native and Kubernetes observability Knowledge of cloud infrastructure (Azure) and cluster management tools like Kubernetes Strong communication skills with ability to align the organization on complex technical decisions Bachelor's or master s degree in information technology, Computer Science, or a related quantitative discipline

Senior Site Reliability Engineer
Mashreq Corporate & Inve...

People Looking for Jobs also searched

Senior Site Reliability Engineer Mashreq Corporate & Inve...

People Looking for Jobs also searched

Senior Site Reliability Engineer
Mashreq Corporate & Inve...