Job Description

Roles & Responsibilities

Job Purpose (specific to this role)
The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM s AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy from model governance and deployment automation to compliance enforcement ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and DevSecOps ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.

You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.

You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems.

AI-Augmented Product Development Model (Context for the Role)
We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3 4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making.

Core Principles
Security is integrated into every decision, from architecture to deployment.
Repetitive tasks are automated; human effort is focused on strategy and problem-solving.
Quality is measurable, enforced, and automated at every stage.
All system behaviors including AI-assisted outputs must be traceable, reviewable, and explainable. We do not ship black box functionality.
Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments.

Key Responsiblities

AI MLOps Architecture & Governance (30%)

Define the MLOps architecture and governance framework across products.
Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers.
Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments.
Lead architectural designs and reviews for AI pipelines.
Design and maintain LLM inference infrastructure
Manage model registries and versioning (MLflow, Weights & Biases)
Implement model serving solutions (TensorFlow Serving, TorchServe, vLLM)
Optimize model performance and cost (quantization, caching, batching)
Build and maintain vector databases (Pinecone, Weaviate, Chroma)
Hardware and inference optimization awareness

Agent & Tool Development (25%)

Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection).
Build AI-assisted DevSecOps utilities to automatically enforce compliance, logging, and audit policies.
Build tool integrations for LLM agents (function calling, APIs)
Implement retrieval-augmented generation (RAG) pipelines
Create prompt management and versioning systems
Monitor and optimize agent performance

CI/CT/CD Pipelines (20%)
Build continuous integration pipelines for models and code
Implement continuous training (CT) workflows
Automate model deployment with rollback capabilities
Create staging and production deployment strategies
Integrate AI-assisted code review into CI/CD
Building a continuous evaluation loop

Infrastructure & Automation (15%)

Manage cloud infrastructure (Kubernetes, serverless)
Implement Infrastructure as Code (Terraform, Pulumi)
Build monitoring and observability systems (Prometheus, Grafana, DataDog)
Automate operational tasks with AI agents
Ensure security and compliance (OWASP, SOC2) - AI-specific security

Developer Enablement (10%)

Provide tools and libraries for engineers to adopt AI-augmented workflows securely.
Document AI/ML best practices and patterns
Conduct training on MLOps tools and workflows
Support engineers with AI integration challenges
Maintain development environment parity
AI Privacy, Governance, and Compliance

Education and Minimum Qualification

BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred.

8+ years in DevOps, SRE, or platform engineering
5+ years hands-on experience with ML/AI systems in production
Deep understanding of LLMs and their operational requirements
Experience building and maintaining CI/CD pipelines
Strong Linux/Unix systems knowledge
Cloud platform expertise (AWS, GCP, or Azure)
Experience with container orchestration (Kubernetes)

Key Skills

MLOps & AI:
LLM Integration: OpenAI API, Anthropic API, HuggingFace, Azure OpenAI
Model Serving: TensorFlow Serving, TorchServe, vLLM, Ollama
Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
Model Registries: MLflow, Kubeflow, AWS SageMaker
Vector Databases: Pinecone, Weaviate, Chroma, Milvus
Agent Frameworks: LangChain, LlamaIndex, AutoGPT, Semantic Kernel
Fine-tuning: LoRA, QLoRA, prompt tuning

Data Engineering:
Pipelines: Airflow, Prefect, Dagster
Processing: Spark, Dask, Ray
Streaming: Kafka, Pulsar, Kinesis
Data Quality: Great Expectations, dbt
Feature Stores: Feast, Tecton

DevOps & Infrastructure:
Containers: Docker, Kubernetes, Helm
Cloud Platforms: AWS (SageMaker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio)

Principal Engineer - ML Ops EDGE Group PJSC

People Looking for Jobs also searched