Send me Jobs like this
Nationality
Any Nationality
Gender
Not Mentioned
Vacancy
1 Vacancy
Job Description
Roles & Responsibilities
Reliability & Incident Management
- Lead high-severity incident response and drive post-incident reviews.
- Troubleshoot complex issues across applications, infrastructure, and networks.
- Improve MTTR through better monitoring, alerts, and diagnostic tooling.
- Participate in the on-call rotation supporting production systems.
Performance & Scalability br>
- Identify and resolve performance bottlenecks and scaling challenges.
- Conduct load testing and capacity planning for high-traffic scenarios.
Infrastructure & Operations br>
- Enhance cloud-native infrastructure, deployment processes, and automation.
- Improve resilience, fault-tolerance, and recovery mechanisms across systems.
Observability br>
- Build and refine dashboards, alerts, metrics, logs, and traces.
- Define SLIs/SLOs and improve visibility into system behavior.
Tooling & Automation br>
- Develop tools that reduce operational toil and increase reliability.
- Contribute to infrastructure-as-code, CI/CD pipelines, and GitOps workflows.
Collaboration br>
- Work closely with engineering teams to ensure services are robust and production-ready.
- Mentor engineers on reliability, debugging, and operational best practices.
Required Skills br>
- Strong experience with Kubernetes, service mesh technologies, and cloud platforms (AWS/GCP/Azure).
- Deep understanding of Linux, networking, distributed systems, and load balancers.
- Hands-on with Terraform or similar IaC tools.
- Experience with Prometheus, Grafana, Loki, Mimir, Elastic, or similar observability tools.
- Proficiency in scripting/programming (Bash, Python, Go).
- Experience with CI/CD and GitOps.
- Strong debugging, incident response, and performance analysis skills.
Bonus Skills br>
- Background in large-scale, high-traffic systems.
- Experience with fault-tolerant design, DR, and HA patterns.
- Familiarity with SLOs, SLIs, and error budgets.
Desired Candidate Profile
Required Skills br
- Strong experience with Kubernetes, service mesh technologies, and cloud platforms (AWS/GCP/Azure).
- Deep understanding of Linux, networking, distributed systems, and load balancers.
- Hands-on with Terraform or similar IaC tools.
- Experience with Prometheus, Grafana, Loki, Mimir, Elastic, or similar observability tools.
- Proficiency in scripting/programming (Bash, Python, Go).
- Experience with CI/CD and GitOps.
- Strong debugging, incident response, and performance analysis skills.
Bonus Skills br
- Background in large-scale, high-traffic systems.
- Experience with fault-tolerant design, DR, and HA patterns.
- Familiarity with SLOs, SLIs, and error budgets.
Company Industry
- Internet
- E-commerce
- Dotcom
Department / Functional Area
- IT Software
Keywords
- Senior Site Reliability Engineer (SRE)
Disclaimer: Naukrigulf.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@naukrigulf.com