Site Reliability Engineer - CTJ - Poly

Microsoft onsite • Redmondfull_time

Leverages end-to-end technical expertise in large scale distributed systems' infrastructure, code, inter- and intra-service dependencies, and operations to proactively and continuously improve the reliability, performance, efficiency, latency, and scalability of services and/or products operating at scale. Partners with software engineering product teams by suggesting scalable ways to optimize code, sharing expertise and insights drawn from working across related services or products, and participating in incident response throughout development and operations lifecycles. Develops code, scripts, systems, and/or tools that reduce operational burden by automating complex and repetitive tasks, enable product engineering teams to increase the velocity at which they can safely deploy changes to production, and monitor the effects of changes across systems, services, and/or products. Analyzes telemetry data to develop capacity planning models, identify patterns and trends that drive continuous improvement, and highlight opportunities to deploy automation to monitor and manage services and/or products. Participates in on-call rotations to resolve live site incidents, minimize customer impact, and document solutions and insights that inform ongoing improvements to infrastructure, code, tools, and/or processes that prevent the recurrence of similar issues.

 

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.