Site Reliability Engineer Lucidya

Employer Active

Posted 7 hrs ago

Experience

3 - 8 Years

Job Location

Riyadh - Saudi Arabia

Education

Bachelor of Technology/Engineering(Computers)

Nationality

Any Nationality

Gender

Not Mentioned

Vacancy

1 Vacancy

Job Description

Roles & Responsibilities

You ll be responsible for outcomes, not just tasks. Here s what success looks like in this role:

You ll make reliability the default

  • You ll design and maintain infrastructure that is highly available, fault-tolerant, and scalable
  • You ll proactively identify and eliminate single points of failure before they become incidents
  • You ll ensure our production systems remain stable, even under increasing scale and load

You ll own and optimize our cloud environments

  • You ll manage and continuously improve workloads across AWS, GCP, or Azure
  • You ll use Infrastructure as Code (Terraform) to standardize and scale infrastructure
  • You ll optimize resource usage to balance performance and cost

You ll run and improve Kubernetes in production

  • You ll operate and scale Kubernetes clusters (EKS, GKE, etc.) with confidence
  • You ll troubleshoot issues quickly and ensure smooth deployments and upgrades
  • You ll ensure our containerized workloads perform reliably at scale

You ll build strong observability and respond to incidents

  • You ll implement and refine monitoring systems using tools like Prometheus, Grafana, Datadog, or ELK
  • You ll define alerting that is meaningful, not noisy
  • You ll respond to incidents, lead root cause analysis, and ensure we learn from every failure

You ll automate everything that shouldn t be manual

  • You ll write scripts and build tooling to eliminate repetitive operational work
  • You ll continuously improve infrastructure efficiency through automation
  • You ll promote a culture where manual work is a temporary state, not the norm

You ll collaborate to improve the entire system

  • You ll work closely with DevOps and engineering teams to solve performance bottlenecks
  • You ll contribute to CI/CD improvements and deployment reliability
  • You ll help shape reliability best practices across the organization

What success looks like (First 90 Days)

First 30 days:

  • You ve built a strong understanding of our infrastructure, systems, and workflows
  • You re contributing to day-to-day operations with support from the team
  • You ve started identifying areas for improvement in automation and reliability

By 90 days:

  • You re independently managing infrastructure tasks and troubleshooting issues
  • You re actively contributing to reliability and scalability improvements
  • You ve taken ownership of parts of our infrastructure and are improving them

Desired Candidate Profile

This is what will make you successful in this role:

  • You ve spent ~3 years working in SRE, DevOps, or infrastructure engineering, and you ve seen what breaks at scale
  • You re comfortable working in cloud environments like AWS, GCP, or Azure and you understand how distributed systems behave
  • You ve worked hands-on with Kubernetes in production and know how to troubleshoot it when things go wrong
  • You don t just fix issues - you ask why they happened and make sure they don t happen again

Technically, you likely:

  • Use Terraform (or similar IaC tools) to manage infrastructure
  • Work confidently with Docker and Kubernetes
  • Write scripts in Python, Bash, or similar to automate workflows
  • Understand CI/CD pipelines (Jenkins, GitHub Actions, Bitbucket, etc.)
  • Have a solid grasp of networking, load balancing, and high-availability design

When it comes to monitoring:

  • You ve implemented tools like Prometheus, Grafana, Datadog, or ELK
  • You know the difference between useful alerts and noise
  • You focus on signals that actually drive action

What sets you apart:

  • You take ownership - you don t wait to be told something is broken
  • You re calm under pressure and methodical during incidents
  • You simplify complexity instead of adding to it
  • You communicate clearly, even when explaining deeply technical issues
  • You care about building systems that make other engineers more effective

Nice to Have (but not required)

  • Experience with RabbitMQ or Redis in production
  • Familiarity with Ansible or AWX
  • Exposure to multi-cloud or hybrid environments
  • Cloud certifications (AWS, GCP) or Linux certifications
  • Background from ITI (Information Technology Institute)

Company Industry

Department / Functional Area

Keywords

  • Site Reliability Engineer

Disclaimer: Naukrigulf.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@naukrigulf.com

Lucidya

Lucidya is an AI-native platform for customer experience (CX) intelligence that manages entire customer lifecycles autonomously, from initial engagement through retention and growth.

Unlike platforms that only surface insights and leave the action to you, Lucidya closes the loop with proprietary NLU technology built in-house and trained on millions of multilingual conversations. This enables marketing, support, CX, and research teams to deliver personalized experiences that drive measurable improvements in customer satisfaction, retention, and lifetime value.

As we continue scaling globally, the reliability, performance, and resilience of our infrastructure become mission-critical to everything we do.

Read More

https://apply.workable.com/lucidya/j/B519978733/