๐Ÿ’ผ Professional Experience

๐Ÿš€ Work History

๐Ÿ‘จโ€๐Ÿ’ป Principal Engineer โ€“ BookMyShow

BigTree Entertainment Pvt. Ltd. | April 2025 โ€“ Present

Driving technical excellence at the intersection of SRE, Platform Engineering, and AI โ€” building intelligent systems that reduce toil, accelerate incident response, and future-proof infrastructure.

๐Ÿค– AI-Powered Automated RCA

Built an agentic AI system that performs end-to-end Root Cause Analysis autonomously โ€” just provide a deployment name and time range, and the agent does the rest.

  • Architected an Agentic AI pipeline using LangChain integrated with MCP (Model Context Protocol) servers to connect with Kubernetes, Prometheus, Elastic APM, and Coralogix.
  • The agent autonomously fetches pod status, queries metrics, pulls APM traces, and correlates Kubernetes events to pinpoint the root cause โ€” zero manual investigation needed.
  • Reduced mean-time-to-identify (MTTI) from 45+ minutes to under 5 minutes for common failure patterns.
  • Applied prompt engineering to fine-tune agent reasoning chains, ensuring accurate correlation across noisy data sources.

โšก Streamlining SRE with AI

  • Smart runbook execution: When an alert fires (e.g. "Pod CrashLoopBackOff"), the AI agent reads the runbook, checks logs and metrics, and suggests the fix โ€” engineers just approve and it executes.
  • Capacity planning before big events: Before a major sale (e.g. Coldplay concert, IPL final), the agent pulls past traffic patterns for similar event types, factors in current ticket inventory and sale velocity, and generates a scaling baseline โ€” giving engineers a data-backed starting point instead of hours of manual spreadsheet analysis. Final numbers are still tuned by the team since every event's scale is unique.
  • Faster incident triage: Using prompt engineering to build reusable AI workflows that auto-classify incidents, assess change risk, and draft post-incident summaries in minutes instead of hours.

๐Ÿ”„ Legacy Bamboo โ†’ Argo CD Migration

  • Led the migration from legacy Atlassian Bamboo CI/CD to Argo CD for GitOps-native continuous delivery.
  • Designed a phased rollout strategy โ€” migrating 100+ pipelines with zero deployment downtime.
  • Introduced ApplicationSets and Helm-based Argo CD patterns for standardized, self-service deployments across teams.

๐Ÿ“ Streamlining IaC โ€” AWS to Terraform/Terragrunt

  • Existing AWS infrastructure (VPCs, EKS clusters, RDS instances, S3 buckets, IAM roles) was created manually or via CloudFormation โ€” no single source of truth and drift was hard to track.
  • Used terraform import to bring existing resources under Terraform state, then layered Terragrunt on top for DRY configurations across multiple environments and accounts.
  • Now every infra change goes through code review and a plan/apply pipeline โ€” no more ClickOps or undocumented manual changes in the AWS console.

๐Ÿ’ฐ ARM64 Migration for Cost Optimization

  • Spearheaded migration of Kubernetes workloads from x86 to AWS Graviton (ARM64) instances.
  • Rebuilt multi-arch container images, validated compatibility across services, and rolled out progressively using canary deployments.
  • Achieved ~30% infrastructure cost reduction while maintaining identical performance benchmarks.

๐Ÿ›ก๏ธ Disaster Recovery & Real-Time Data Resilience

  • Always-on critical data: Optimized the DR setup so that critical components โ€” payments, inventory, booking state โ€” have real-time data replication across regions. If a region goes down, users don't see stale data or failed transactions.
  • RDS cross-region read replicas with near-zero lag: Tuned replication to keep lag under seconds for business-critical databases, so the failover site always has a current view of live bookings and transactions.
  • Automated failover validation: Built health-check agents that continuously verify DR readiness โ€” confirming data sync status, connection pool health, and cache warm-up state โ€” instead of relying on quarterly manual drills alone.
  • Business continuity focus: Shifted DR thinking from "recover the infra" to "keep the business running" โ€” ensuring that even during partial failures, users can still browse, book, and pay without noticing a thing.

๐Ÿ—๏ธ Technical Leadership

  • System design: Owning architecture decisions for services handling millions of daily transactions โ€” from database sharding strategies to API gateway patterns.
  • Growing the team: Mentoring senior engineers through hands-on pairing sessions on Kubernetes debugging, observability stack setup, and building their first AI agents.
  • Championing AI-first SRE: Driving adoption of AI tools across the org โ€” e.g. replacing manual post-incident reviews with AI-generated RCA reports that teams can iterate on.

๐Ÿ‘ฉโ€๐Ÿ’ป SRE III โ€“ BookMyShow

BigTree Entertainment Pvt. Ltd. | Apr 2021 โ€“ Present

As a senior SRE, I've been instrumental in designing scalable infrastructure, ensuring high availability for large-scale events, and embedding reliability across the CI/CD lifecycle.

๐Ÿ› ๏ธ CI/CD Architecture & Release Automation

  • Standardized CI/CD across teams using GitLab, Bitbucket, and Bamboo.
  • Integrated SonarQube for quality gates; cut production issues by 30%.
  • Enabled reusable deployment templates with safe rollback support.

โ˜๏ธ Cloud Migration & Infra Modernization

  • Migrated core workloads from VMware & GCP to AWS with EKS, EC2, RDS.
  • Replaced JFrog with Amazon ECR for better cost and container management.
  • Automated infra provisioning via CloudFormation & Ansible.

๐ŸŒ Disaster Recovery Implementation

  • Built a multi-region DR architecture with RDS cross-region replication, S3 backups, and Route 53 failover.
  • Authored DR runbooks and executed regular failover drills.
  • Reduced RTO from 4h to <30 mins across critical services.

๐Ÿ“ˆ Scalability for High-Traffic Events

  • Handled peak loads (5x+ traffic) during the Cricket World Cup and concerts.
  • Tuned EKS with HPA, disruption budgets, and circuit-breakers.
  • Monitored with Grafana, Prometheus, synthetic testing, and APM tools like New Relic and ELK Stack APM

๐Ÿงฉ Istio-Based Service Mesh Deployment

  • Introduced advanced traffic routing, retries, mirroring, and observability for microservices.
  • Improved service resilience and debugging via sidecar telemetry and distributed tracing with Jaeger.

๐Ÿง  Reliability Culture & Team Enablement

  • Led incident response for P0s, with detailed postmortems and RCA reviews.
  • Trained new SREs on Kubernetes, observability tools, and CI/CD platforms.
  • Documented internal architecture and DR knowledge base.

โš™๏ธ DevOps Engineer II โ€“ HERE Technologies

Nov 2019 โ€“ Mar 2021

  • ๐Ÿ“Š CI Observability Dashboards: Built real-time Grafana dashboards for CI pipelines using Python & MySQL โ€” reduced build failures by 25%.
  • โšก GitLab Runner Optimization: Improved flaky job reliability and cut CI build time by 30%.
  • ๐Ÿ› ๏ธ Infrastructure Provisioning: Automated AWS infra via Ansible โ€” reduced manual errors significantly.
  • ๐Ÿ’ฐ Cloud Cost Optimization: Cut AWS costs by 20% using RIs and right-sized resources.
  • ๐Ÿš€ Automated Deployment Pipelines: Rolled out Jenkins pipelines to improve deployment speed and consistency.

๐Ÿ”ง DevOps Deployment Engineer โ€“ Zycus

Oct 2017 โ€“ Nov 2019

Zycus is a global leader in Source-to-Pay procurement software, empowering enterprises with automation-driven solutions.

  • ๐Ÿ GitLab Access Automation: Automated GitLab access control using Python scripts for streamlined onboarding.
  • ๐Ÿงช CI Pipeline Hardening: Integrated GitLab CLI with Jenkins, SonarQube, and Nexus to enforce quality gates.
  • ๐Ÿš€ Release Management: Coordinated deployment planning across multiple non-prod and prod environments.
  • โ˜๏ธ Hybrid Cloud Management: Provisioned and maintained infra on AWS, Navisite, and VMware platforms.
  • ๐Ÿ“ฆ AWS Services Integration: Deployed VPC, EC2, ALB, Auto Scaling, and S3 in scalable infra setups.
  • ๐Ÿณ Docker-Based Dev Envs: Enabled isolated dev/testing using Docker Compose and shared base images.
  • ๐Ÿ”„ Developer Enablement: Guided devs in creating Dockerfiles and containerizing local apps.
  • ๐Ÿ”ง Ansible Configuration: Managed infra configuration and app deployments via Ansible roles/playbooks.
  • ๐Ÿงญ Consul for Service Discovery: Leveraged Consul to manage dynamic service configurations.
  • ๐ŸŒ Web Server Config: Served applications via Apache, Nginx, and HAProxy for high availability.
  • ๐Ÿ›ก๏ธ CI/CD Quality Assurance: Ensured stable deployments through robust infra testing and rollout strategies.

๐Ÿฉบ DevOps Engineer โ€“ Doctor Insta (via OpsTree Solutions)

June 2016 โ€“ September 2017

Doctor Insta is a telehealth platform offering digital primary care and remote doctor consultations across India.

  • ๐Ÿ”ง Infra Automation with Ansible: Created reusable roles/playbooks for consistent infra provisioning.
  • โ˜๏ธ AWS Infrastructure Management: Managed EC2, RDS, VPC, S3, and Route 53 across environments.
  • ๐Ÿ“ฆ Region Migration: Led successful production migration across AWS regions with minimal downtime.
  • ๐Ÿ” Git Hosting via Gitolite: Deployed and maintained secure, internal Git repositories.
  • ๐Ÿš€ CI/CD with Jenkins: Automated deployments and app builds using job-based pipelines.
  • ๐Ÿ’พ DB Backup Automation: Scheduled daily backups with secure uploads to AWS S3.
  • ๐Ÿ“ˆ Monitoring with Zabbix: Implemented end-to-end infra and app monitoring for alerts and health checks.
  • ๐Ÿ Python App Deployment: Deployed apps in isolated virtual environments to ensure dependency consistency.