Lead Site Reliability Engineer/ Expert
SITA
Employer Active
Posted 3 hrs ago
Send me Jobs like this
Experience
8 - 13 Years
Education
Bachelor of Science(Computers), Master of Science
Nationality
Any Nationality
Gender
Not Mentioned
Vacancy
1 Vacancy
Job Description
Roles & Responsibilities
WHAT YOU LL DO
Reliability Engineering Design & maintain resilient systems ensuring high availability, scalability, and fault tolerance. Ensure effective Disaster Recovery (DR), failover strategies, and resilience engineering across environments. Improve platform reliability, observability, and performance across cloud and on premises systems. Establish and maintain SLIs, SLOs, and error budgets to measure and govern service reliability. Take ownership of production availability, capacity planning, performance tuning, and long term reliability initiatives.
Automation, DevOps & NetOps Drive automation for infrastructure provisioning, deployment, monitoring, and operational workflows. Develop and implement auto remediation and self healing solutions to reduce manual intervention. Manage CI/CD pipelines and Infrastructure as Code (IaC) frameworks for secure, repeatable deployments. Implement and manage zero downtime deployment strategies (blue green, canary, rolling). Support containerized and cloud native platforms including Kubernetes, Docker, and distributed systems. Support NetOps tooling and network observability, ensuring visibility into network performance, events, and operational health.
Incident, Problem & Event Management Perform incident management, production troubleshooting, and lead RCA/PMIR (Postmortem) for critical outages. Proactively identify reliability gaps, performance bottlenecks, and operational risks. Optimize incident, event, and problem management processes to reduce MTTR and improve operational efficiency. Define and maintain the event catalog, thresholds, and remediation workflows. Develop event response protocols and ensure teams are trained for rapid incident handling.
Observability & Monitoring Build and maintain observability solutions using monitoring, logging, tracing, and alerting platforms. Implement APM, distributed tracing, and proactive alerting to detect issues early. Integrate network telemetry and NetOps monitoring tools into the overall observability stack. Collaborate with stakeholders to improve event coverage and post event learning. Experience with AI assisted observability, anomaly detection, and predictive alerting.
Deployment & Operational Readiness Own the quality of new release deployments for the PSO. Conduct operational readiness assessments and manage deployment risk. Ensure supportability for new applications, platform releases, and infrastructure changes. Coordinate with internal/external stakeholders to drive continuous service improvement.
Cross Functional Collaboration Work closely with Development, Platform Engineering, Product, T&E ICE, and Service Support Architects to embed reliability best practices. Collaborate with vendors and engineering teams to enhance system reliability and operational excellence. Support new product productization as SGS technical expert and ensure operational readiness.
Desired Candidate Profile
Education and Professional Qualifications:
- Bachelor s degree in Computer Science, Information Technology, Engineering, or a related field. Master s degree preferred for senior roles.
- Relevant certifications such as ITIL, CCNP/CCIE, Palo Alto Security, SASE, SDWAN, Juniper Mist/Aruba, CompTIA Security+, or Certified Kubernetes Administrator (CKA).
- Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies.
- Certifications in automation and IaC tools (Ansible, Terraform).
- Certifications in observability and monitoring platforms (Dynatrace, Prometheus, Grafana, ELK).
- Certifications in ServiceNow, Jira, or other operational tooling.
Experience:
- 8+ years in IT operations, service management, or infrastructure reliability, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Engineer.
- Strong experience with high availability systems, resilience engineering, and DR readiness.
- Deep expertise in RCA, incident management, PMIR, and implementing permanent fixes for recurring issues.
- Hands on experience with CI/CD, automation, IaC, and self healing/auto remediation workflows.
- Proficiency in observability platforms (APM, logging, tracing, alerting) and integrating network telemetry / NetOps monitoring.
- Experience defining and governing SLIs, SLOs, and error budgets to improve service reliability.
- Experience with Kubernetes, containerized workloads, and distributed systems.
- Experience managing deployments, operational readiness, risk assessments, and improving event/problem management processes.
- Strong cross functional collaboration with Development, Operations, Engineering, Product, T&E ICE, and SSA.
- Familiarity with cloud platforms, scalable architectures, and zero downtime deployment strategies.
Technical Skills:
Cloud Infrastructure AWS/Azure, Linux, virtualization, HA/DR architecture. Automation & IaC Ansible, Terraform, CI/CD pipelines, self healing workflows. Observability & Monitoring APM, logging, tracing, alerting, Dynatrace, Prometheus, Grafana, ELK. NetOps Monitoring network telemetry, event monitoring, and operational visibility tools. Containerization & Orchestration Docker, Kubernetes, distributed systems. Deployment & Release Engineering zero downtime strategies (blue green, canary), operational readiness. Programming & Scripting Python, Bash, PowerShell for automation and tooling. Reliability Engineering SLIs/SLOs, error budgets, capacity planning, performance tuning.
Company Industry
- IT - Software Services
Department / Functional Area
- IT Software
Keywords
- Lead Site Reliability Engineer/ Expert
Disclaimer: Naukrigulf.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@naukrigulf.com
SITA
At SITA, we keep airports moving, airlines flying smoothly, and borders open. Our technology and communication innovations power the success of the global air travel industry. You'll find us in 95% of international airports, working closely with over 2,500 transportation and government clients. Each partnership brings unique challenges, and we thrive on delivering fresh solutions and cutting-edge tech to keep operations running like clockwork. We don't just move the world forward-we're proud to be recognized as a Great Place to Work by 79% of our employees and certified in most of our growing locations. Here, we feel empowered, supported, and inspired to grow. Are you ready to love your job? The adventure begins right here, with you, at SITA. ABOUT THE ROLE & TEAM Responsible for ensuring highly reliable, scalable, and resilient production systems across cloud and on prem environments. Ensures high availability, disaster recovery readiness, and continuous improvement of service performance. Leads automation initiatives for provisioning, deployment, monitoring, and self healing to reduce manual effort and improve stability. Owns the event catalog, operational readiness, and reliability engineering practices to prevent recurrence of incidents and strengthen system resilience. Drives collaboration across Product, Engineering, T&E ICE, and Service Support Architects to ensure provider grade reliability and seamless operational integration of new releases.
https://careers.sita.aero/jobs/10468?lang=en-us&previousLocale=en-US