Expert-level proficiency operating large-scale, distributed, mission-critical systems: designing for high availability, multi-region resiliency, low latency, and predictable performance under extreme load.
SRE fundamentals at Staff level: defines and drives SLOs/SLIs, error budgets, availability targets, and capacity guardrails codifies reliability requirements into design reviews and change-management gates.
Deep hands-on with Kubernetes and container platforms: multi-cluster operations, workload placement, HPA/VPA, pod disruption budgets, network policies, admission control, service mesh (Istio/Linkerd), and progressive delivery (blue/green, canary, feature flags).
Infra as Code and GitOps: Terraform (and/or Pulumi), Helm/Kustomize, Argo CD/Flux builds reusable modules, policy-as-code (OPA/Conftest), environment drift detection, and automated remediation.
Observability at scale: OpenTelemetry instrumentation/tracing, metrics (Prometheus), logging (ELK/OpenSearch), distributed tracing (Jaeger/Tempo/Zipkin), dashboards and SLO burn-rate alerts (Grafana) designs actionable alerts with runbook automation.
Proven incident leadership: serves as Incident Commander for P0/P1 events, coordinates cross-functional response, stabilizes systems, restores service quickly, and drives blameless postmortems with measurable follow-through.
Performance engineering and capacity planning: load and resilience testing, GC/heap and thread tuning (for JVM services), profiling (CPU, memory, IO), caching strategies, queue backpressure, and cost-aware capacity models.
Strong systems and networking: Linux internals, filesystems, TCP/UDP, TLS/mTLS, HTTP/2/3, DNS, BGP/Anycast concepts, L4–L7 load balancing (Envoy/HAProxy/NGINX), CDN/edge (Cloudflare/Fastly/Akamai), WAF, and DDoS mitigation.
Data/store reliability: operational experience with relational (PostgreSQL/MySQL/Oracle) and NoSQL (Cassandra/DynamoDB/MongoDB), streaming platforms (Kafka/Pulsar/Kinesis), and distributed caches (Redis/Hazelcast) backup/restore, consistency models, compaction/retention tuning, and multi-AZ/region failover.
Cloud and platform engineering: AWS/Azure/GCP core services, VPC design, IAM/RBAC, KMS, secrets management (Vault), service catalog, golden images/base containers, and paved-road platforms for developers.
Release engineering and CI/CD: Jenkins/GitHub Actions/GitLab CI, artifact/signing/SBOM, canary analysis, automated rollbacks, deployment safety checks, and change failure rate/MTTR improvements.
Reliability-by-design partnership: participates in and leads architecture/design reviews, threat modeling, and resilience patterns (bulkheads, circuit breakers, idempotency, retry/backoff, dead-letter handling).
Disaster recovery and business continuity: RTO/RPO objectives, runbooks, game days/chaos experiments (Litmus/Gremlin), regional evacuation, and active-active/active-passive strategies.
Security in depth for production systems: least privilege, workload identity, image and dependency scanning, supply-chain hardening (SLSA), SBOM, network segmentation/zero trust, and PCI-DSS-aligned operational controls.
Strong programming and automation: production-grade Go and/or Python (plus Bash), contributing SRE tooling, controllers/operators, and APIs code reviews, testing, and docs-as-code.
Effective communicator and influencer: aligns reliability strategy with business outcomes, mentors engineers, challenges assumptions with data, and proposes pragmatic, incremental improvements.
Experience leveraging GenAI/LLMs as copilots: accelerating runbook authoring, alert triage, knowledge retrieval, and post-incident synthesis with appropriate guardrails and data security.
Nice to have: JVM and Node.js runtime tuning experience traffic engineering at Internet scale mobile edge/network reliability considerations.

This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.

Staff Site Reliability Engineer (8+ years)