Staff Site Reliability Engineer (8+ years)

Visa hybrid • Bangalorefull_time
  • Expert-level proficiency operating large-scale, distributed, mission-critical systems: designing for high availability, multi-region resiliency, low latency, and predictable performance under extreme load. 

  • ​SRE fundamentals at Staff level: defines and drives SLOs/SLIs, error budgets, availability targets, and capacity guardrails codifies reliability requirements into design reviews and change-management gates.

  • Deep hands-on with Kubernetes and container platforms: multi-cluster operations, workload placement, HPA/VPA, pod disruption budgets, network policies, admission control, service mesh (Istio/Linkerd), and progressive delivery (blue/green, canary, feature flags).

  • ​Infra as Code and GitOps: Terraform (and/or Pulumi), Helm/Kustomize, Argo CD/Flux builds reusable modules, policy-as-code (OPA/Conftest), environment drift detection, and automated remediation.

  • Observability at scale: OpenTelemetry instrumentation/tracing, metrics (Prometheus), logging (ELK/OpenSearch), distributed tracing (Jaeger/Tempo/Zipkin), dashboards and SLO burn-rate alerts (Grafana) designs actionable alerts with runbook automation. 

  • Proven incident leadership: serves as Incident Commander for P0/P1 events, coordinates cross-functional response, stabilizes systems, restores service quickly, and drives blameless postmortems with measurable follow-through. 

  • Performance engineering and capacity planning: load and resilience testing, GC/heap and thread tuning (for JVM services), profiling (CPU, memory, IO), caching strategies, queue backpressure, and cost-aware capacity models.

  • Strong systems and networking: Linux internals, filesystems, TCP/UDP, TLS/mTLS, HTTP/2/3, DNS, BGP/Anycast concepts, L4–L7 load balancing (Envoy/HAProxy/NGINX), CDN/edge (Cloudflare/Fastly/Akamai), WAF, and DDoS mitigation.​
  • ​Data/store reliability: operational experience with relational (PostgreSQL/MySQL/Oracle) and NoSQL (Cassandra/DynamoDB/MongoDB), streaming platforms (Kafka/Pulsar/Kinesis), and distributed caches (Redis/Hazelcast) backup/restore, consistency models, compaction/retention tuning, and multi-AZ/region failover. 
  • Cloud and platform engineering: AWS/Azure/GCP core services, VPC design, IAM/RBAC, KMS, secrets management (Vault), service catalog, golden images/base containers, and paved-road platforms for developers.
  • Release engineering and CI/CD: Jenkins/GitHub Actions/GitLab CI, artifact/signing/SBOM, canary analysis, automated rollbacks, deployment safety checks, and change failure rate/MTTR improvements. 
  • Reliability-by-design partnership: participates in and leads architecture/design reviews, threat modeling, and resilience patterns (bulkheads, circuit breakers, idempotency, retry/backoff, dead-letter handling). 
  • Disaster recovery and business continuity: RTO/RPO objectives, runbooks, game days/chaos experiments (Litmus/Gremlin), regional evacuation, and active-active/active-passive strategies. 
  • Security in depth for production systems: least privilege, workload identity, image and dependency scanning, supply-chain hardening (SLSA), SBOM, network segmentation/zero trust, and PCI-DSS-aligned operational controls. 
  • Strong programming and automation: production-grade Go and/or Python (plus Bash), contributing SRE tooling, controllers/operators, and APIs code reviews, testing, and docs-as-code. 
  • Effective communicator and influencer: aligns reliability strategy with business outcomes, mentors engineers, challenges assumptions with data, and proposes pragmatic, incremental improvements. 
  • Experience leveraging GenAI/LLMs as copilots: accelerating runbook authoring, alert triage, knowledge retrieval, and post-incident synthesis with appropriate guardrails and data security. 
  • Nice to have: JVM and Node.js runtime tuning experience traffic engineering at Internet scale mobile edge/network reliability considerations. 

This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.