Send me Jobs like this
Experience
3 - 8 Years
Job Location
Education
Bachelor of Technology/Engineering(Computers)
Nationality
Any Nationality
Gender
Not Mentioned
Vacancy
1 Vacancy
Job Description
Roles & Responsibilities
You ll be responsible for outcomes, not just tasks. Here s what success looks like in this role:
You ll make reliability the default
- You ll design and maintain infrastructure that is highly available, fault-tolerant, and scalable
- You ll proactively identify and eliminate single points of failure before they become incidents
- You ll ensure our production systems remain stable, even under increasing scale and load
You ll own and optimize our cloud environments
- You ll manage and continuously improve workloads across AWS, GCP, or Azure
- You ll use Infrastructure as Code (Terraform) to standardize and scale infrastructure
- You ll optimize resource usage to balance performance and cost
You ll run and improve Kubernetes in production
- You ll operate and scale Kubernetes clusters (EKS, GKE, etc.) with confidence
- You ll troubleshoot issues quickly and ensure smooth deployments and upgrades
- You ll ensure our containerized workloads perform reliably at scale
You ll build strong observability and respond to incidents
- You ll implement and refine monitoring systems using tools like Prometheus, Grafana, Datadog, or ELK
- You ll define alerting that is meaningful, not noisy
- You ll respond to incidents, lead root cause analysis, and ensure we learn from every failure
You ll automate everything that shouldn t be manual
- You ll write scripts and build tooling to eliminate repetitive operational work
- You ll continuously improve infrastructure efficiency through automation
- You ll promote a culture where manual work is a temporary state, not the norm
You ll collaborate to improve the entire system
- You ll work closely with DevOps and engineering teams to solve performance bottlenecks
- You ll contribute to CI/CD improvements and deployment reliability
- You ll help shape reliability best practices across the organization
What success looks like (First 90 Days)
First 30 days:
- You ve built a strong understanding of our infrastructure, systems, and workflows
- You re contributing to day-to-day operations with support from the team
- You ve started identifying areas for improvement in automation and reliability
By 90 days:
- You re independently managing infrastructure tasks and troubleshooting issues
- You re actively contributing to reliability and scalability improvements
- You ve taken ownership of parts of our infrastructure and are improving them
Desired Candidate Profile
This is what will make you successful in this role:
- You ve spent ~3 years working in SRE, DevOps, or infrastructure engineering, and you ve seen what breaks at scale
- You re comfortable working in cloud environments like AWS, GCP, or Azure and you understand how distributed systems behave
- You ve worked hands-on with Kubernetes in production and know how to troubleshoot it when things go wrong
- You don t just fix issues - you ask why they happened and make sure they don t happen again
Technically, you likely:
- Use Terraform (or similar IaC tools) to manage infrastructure
- Work confidently with Docker and Kubernetes
- Write scripts in Python, Bash, or similar to automate workflows
- Understand CI/CD pipelines (Jenkins, GitHub Actions, Bitbucket, etc.)
- Have a solid grasp of networking, load balancing, and high-availability design
When it comes to monitoring:
- You ve implemented tools like Prometheus, Grafana, Datadog, or ELK
- You know the difference between useful alerts and noise
- You focus on signals that actually drive action
What sets you apart:
- You take ownership - you don t wait to be told something is broken
- You re calm under pressure and methodical during incidents
- You simplify complexity instead of adding to it
- You communicate clearly, even when explaining deeply technical issues
- You care about building systems that make other engineers more effective
Nice to Have (but not required)
- Experience with RabbitMQ or Redis in production
- Familiarity with Ansible or AWX
- Exposure to multi-cloud or hybrid environments
- Cloud certifications (AWS, GCP) or Linux certifications
- Background from ITI (Information Technology Institute)
Company Industry
- IT - Software Services
Department / Functional Area
- IT Software
Keywords
- Site Reliability Engineer
Disclaimer: Naukrigulf.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@naukrigulf.com
Lucidya
Lucidya is an AI-native platform for customer experience (CX) intelligence that manages entire customer lifecycles autonomously, from initial engagement through retention and growth. Unlike platforms that only surface insights and leave the action to you, Lucidya closes the loop with proprietary NLU technology built in-house and trained on millions of multilingual conversations. This enables marketing, support, CX, and research teams to deliver personalized experiences that drive measurable improvements in customer satisfaction, retention, and lifetime value. As we continue scaling globally, the reliability, performance, and resilience of our infrastructure become mission-critical to everything we do.