๐จโ๐ป Principal Engineer โ BookMyShow
BigTree Entertainment Pvt. Ltd. | April 2025 โ Present
Driving technical excellence at the intersection of SRE, Platform Engineering, and AI โ building intelligent systems that reduce toil, accelerate incident response, and future-proof infrastructure.
๐ค AI-Powered Automated RCA
Built an agentic AI system that performs end-to-end Root Cause Analysis autonomously โ just provide a deployment name and time range, and the agent does the rest.
- Architected an Agentic AI pipeline using LangChain integrated with MCP (Model Context Protocol) servers to connect with Kubernetes, Prometheus, Elastic APM, and Coralogix.
- The agent autonomously fetches pod status, queries metrics, pulls APM traces, and correlates Kubernetes events to pinpoint the root cause โ zero manual investigation needed.
- Reduced mean-time-to-identify (MTTI) from 45+ minutes to under 5 minutes for common failure patterns.
- Applied prompt engineering to fine-tune agent reasoning chains, ensuring accurate correlation across noisy data sources.
โก Streamlining SRE with AI
- Smart runbook execution: When an alert fires (e.g. "Pod CrashLoopBackOff"), the AI agent reads the runbook, checks logs and metrics, and suggests the fix โ engineers just approve and it executes.
- Capacity planning before big events: Before a major sale (e.g. Coldplay concert, IPL final), the agent pulls past traffic patterns for similar event types, factors in current ticket inventory and sale velocity, and generates a scaling baseline โ giving engineers a data-backed starting point instead of hours of manual spreadsheet analysis. Final numbers are still tuned by the team since every event's scale is unique.
- Faster incident triage: Using prompt engineering to build reusable AI workflows that auto-classify incidents, assess change risk, and draft post-incident summaries in minutes instead of hours.
๐ Legacy Bamboo โ Argo CD Migration
- Led the migration from legacy Atlassian Bamboo CI/CD to Argo CD for GitOps-native continuous delivery.
- Designed a phased rollout strategy โ migrating 100+ pipelines with zero deployment downtime.
- Introduced ApplicationSets and Helm-based Argo CD patterns for standardized, self-service deployments across teams.
๐ Streamlining IaC โ AWS to Terraform/Terragrunt
- Existing AWS infrastructure (VPCs, EKS clusters, RDS instances, S3 buckets, IAM roles) was created manually or via CloudFormation โ no single source of truth and drift was hard to track.
- Used terraform import to bring existing resources under Terraform state, then layered Terragrunt on top for DRY configurations across multiple environments and accounts.
- Now every infra change goes through code review and a plan/apply pipeline โ no more ClickOps or undocumented manual changes in the AWS console.
๐ฐ ARM64 Migration for Cost Optimization
- Spearheaded migration of Kubernetes workloads from x86 to AWS Graviton (ARM64) instances.
- Rebuilt multi-arch container images, validated compatibility across services, and rolled out progressively using canary deployments.
- Achieved ~30% infrastructure cost reduction while maintaining identical performance benchmarks.
๐ก๏ธ Disaster Recovery & Real-Time Data Resilience
- Always-on critical data: Optimized the DR setup so that critical components โ payments, inventory, booking state โ have real-time data replication across regions. If a region goes down, users don't see stale data or failed transactions.
- RDS cross-region read replicas with near-zero lag: Tuned replication to keep lag under seconds for business-critical databases, so the failover site always has a current view of live bookings and transactions.
- Automated failover validation: Built health-check agents that continuously verify DR readiness โ confirming data sync status, connection pool health, and cache warm-up state โ instead of relying on quarterly manual drills alone.
- Business continuity focus: Shifted DR thinking from "recover the infra" to "keep the business running" โ ensuring that even during partial failures, users can still browse, book, and pay without noticing a thing.
๐๏ธ Technical Leadership
- System design: Owning architecture decisions for services handling millions of daily transactions โ from database sharding strategies to API gateway patterns.
- Growing the team: Mentoring senior engineers through hands-on pairing sessions on Kubernetes debugging, observability stack setup, and building their first AI agents.
- Championing AI-first SRE: Driving adoption of AI tools across the org โ e.g. replacing manual post-incident reviews with AI-generated RCA reports that teams can iterate on.
๐ฉโ๐ป SRE III โ BookMyShow
BigTree Entertainment Pvt. Ltd. | Apr 2021 โ Present
As a senior SRE, I've been instrumental in designing scalable infrastructure, ensuring high availability for large-scale events, and embedding reliability across the CI/CD lifecycle.
๐ ๏ธ CI/CD Architecture & Release Automation
- Standardized CI/CD across teams using GitLab, Bitbucket, and Bamboo.
- Integrated SonarQube for quality gates; cut production issues by 30%.
- Enabled reusable deployment templates with safe rollback support.
โ๏ธ Cloud Migration & Infra Modernization
- Migrated core workloads from VMware & GCP to AWS with EKS, EC2, RDS.
- Replaced JFrog with Amazon ECR for better cost and container management.
- Automated infra provisioning via CloudFormation & Ansible.
๐ Disaster Recovery Implementation
- Built a multi-region DR architecture with RDS cross-region replication, S3 backups, and Route 53 failover.
- Authored DR runbooks and executed regular failover drills.
- Reduced RTO from 4h to <30 mins across critical services.
๐ Scalability for High-Traffic Events
- Handled peak loads (5x+ traffic) during the Cricket World Cup and concerts.
- Tuned EKS with HPA, disruption budgets, and circuit-breakers.
- Monitored with Grafana, Prometheus, synthetic testing, and APM tools like New Relic and ELK Stack APM
๐งฉ Istio-Based Service Mesh Deployment
- Introduced advanced traffic routing, retries, mirroring, and observability for microservices.
- Improved service resilience and debugging via sidecar telemetry and distributed tracing with Jaeger.
๐ง Reliability Culture & Team Enablement
- Led incident response for P0s, with detailed postmortems and RCA reviews.
- Trained new SREs on Kubernetes, observability tools, and CI/CD platforms.
- Documented internal architecture and DR knowledge base.
๐ง DevOps Deployment Engineer โ Zycus
Oct 2017 โ Nov 2019
Zycus is a global leader in Source-to-Pay procurement software, empowering enterprises with automation-driven solutions.
- ๐ GitLab Access Automation: Automated GitLab access control using Python scripts for streamlined onboarding.
- ๐งช CI Pipeline Hardening: Integrated GitLab CLI with Jenkins, SonarQube, and Nexus to enforce quality gates.
- ๐ Release Management: Coordinated deployment planning across multiple non-prod and prod environments.
- โ๏ธ Hybrid Cloud Management: Provisioned and maintained infra on AWS, Navisite, and VMware platforms.
- ๐ฆ AWS Services Integration: Deployed VPC, EC2, ALB, Auto Scaling, and S3 in scalable infra setups.
- ๐ณ Docker-Based Dev Envs: Enabled isolated dev/testing using Docker Compose and shared base images.
- ๐ Developer Enablement: Guided devs in creating Dockerfiles and containerizing local apps.
- ๐ง Ansible Configuration: Managed infra configuration and app deployments via Ansible roles/playbooks.
- ๐งญ Consul for Service Discovery: Leveraged Consul to manage dynamic service configurations.
- ๐ Web Server Config: Served applications via Apache, Nginx, and HAProxy for high availability.
- ๐ก๏ธ CI/CD Quality Assurance: Ensured stable deployments through robust infra testing and rollout strategies.
๐ฉบ DevOps Engineer โ Doctor Insta (via OpsTree Solutions)
June 2016 โ September 2017
Doctor Insta is a telehealth platform offering digital primary care and remote doctor consultations across India.
- ๐ง Infra Automation with Ansible: Created reusable roles/playbooks for consistent infra provisioning.
- โ๏ธ AWS Infrastructure Management: Managed EC2, RDS, VPC, S3, and Route 53 across environments.
- ๐ฆ Region Migration: Led successful production migration across AWS regions with minimal downtime.
- ๐ Git Hosting via Gitolite: Deployed and maintained secure, internal Git repositories.
- ๐ CI/CD with Jenkins: Automated deployments and app builds using job-based pipelines.
- ๐พ DB Backup Automation: Scheduled daily backups with secure uploads to AWS S3.
- ๐ Monitoring with Zabbix: Implemented end-to-end infra and app monitoring for alerts and health checks.
- ๐ Python App Deployment: Deployed apps in isolated virtual environments to ensure dependency consistency.