Devops and Site Reliability Engineer
Experience : 2-5 Years

Location: Onsite -Hybrid

About Lyzr
At Lyzr, we aren't just building infrastructure; we’re architecting the backbone of the GenAI revolution. We are looking for a Cloud Reliability & DevOps Architect who thrives at the intersection of automation and operational excellence.

Role Overview

We are looking for a high-agility Cloud Reliability & DevOps Architect to join our engineering team. This is a hybrid role designed for a professional who sits at the intersection of infrastructure automation (DevOps) and operational excellence (SRE).

You will be responsible for architecting resilient multi-cloud environments, automating complex delivery pipelines, and ensuring the absolute reliability and cost-efficiency of our production systems. From writing modular Terraform code to leading deep-dive Root Cause Analysis (RCA), you will own the entire lifecycle of our infrastructure.

Key Responsibilities

1. IaC & Automation Architecture

Advanced Development: Architect and maintain complex infrastructure using Terraform (multi-cloud) and AWS CloudFormation.
Modular Design: Create reusable, version-controlled modules to standardize deployments and eliminate code duplication.
Eliminate Toil: Apply SRE principles to automate repetitive operational tasks and manual provisioning through Python, Bash, or Go.

2. Multi-Cloud Operations & Connectivity

Core Management: Optimize production environments across AWS (EC2, EKS, Lambda, VPC) and Azure (VMs, VNet, Functions).
Cross-Cloud Networking: Design secure connectivity solutions between disparate cloud providers and on-premise systems.

3. System Reliability & Observability

End-to-End Ownership: Own the health of production systems, ensuring High Availability (HA) and meeting strict SLOs/SLIs.
Incident Management: Lead the RCA process for outages and implement architectural changes to prevent recurrence.
Observability Frameworks: Build and maintain comprehensive monitoring and alerting (Prometheus, Grafana, ELK Stack, CloudWatch) for early anomaly detection.

4. Security, Compliance & FinOps

Security by Design: Build infrastructure with strict IAM roles, secret management (HashiCorp Vault/KMS), and automated compliance checks (SOC2/ISO).
Cost Optimization: Actively drive FinOps initiatives—rightsizing instances, managing Reserved/Spot instances, and identifying idle resources to reduce waste.
Disaster Recovery: Design and lead periodic DR failover drills to ensure business continuity.

5. CI/CD & Performance Tuning

Pipeline Ownership: Design end-to-end CI/CD pipelines (GitHub Actions, GitLab CI, or Jenkins) for seamless delivery.
Self-Healing Systems: Implement auto-remediation workflows to resolve common system issues without human intervention.

Technical Qualifications

Must-Have Skills:

Experience: 2–5 years in SRE, DevOps, or Cloud Engineering roles.
Cloud Mastery: Hands-on experience managing production workloads in AWS (Expert level) and Azure.
IaC Proficiency: Expert-level knowledge of Terraform (State management, Modules) and CloudFormation.
Scripting: Strong automation skills in Python and Bash.
Monitoring: Hands-on experience with Grafana, Prometheus, or Datadog.

Preferred Qualifications:

Containers: Experience with Kubernetes (EKS/AKS) and orchestration.
Certifications: HashiCorp Certified: Terraform Associate or AWS/Azure DevOps Professional.
Data: Understanding of database administration (PostgreSQL, MySQL, or DynamoDB).

Work Environment & Soft Skills

Global Flexibility: We support clients across IST, GMT, and EST. You must be flexible with working hours for deployments and on-call rotations.
Detective Mindset: You are relentless in debugging and won't stop until you find the root cause of a distributed system issue.
Financial Awareness: You treat cloud resources as real money and take pride in running a lean, efficient infrastructure.
Tech Agility: You are not married to one tool; you use the best tool for the job and pivot as technology evolves.

Devops and Site Reliability Engineer

Join Lyzr as a DevOps & SRE engineer to design reliable multi-cloud infra, automate pipelines with Terraform & Python, and ensure production resilience.