Site Reliability Engineer
Stitch FixFull Time
Mid-level (3 to 4 years)
Candidates should have 3+ years of experience in Site Reliability Engineering, DevOps, or similar roles, with proficiency in at least one high-level programming language like Python, Go, or JavaScript. Experience deploying and managing complex applications on AWS, utilizing Infrastructure as Code tools such as Terraform or SST, and expertise in monitoring and observability systems like Prometheus or Grafana are essential. Knowledge of CI/CD pipelines, modern deployment strategies, strong problem-solving skills, and a systems thinking approach are also required. A passion for leveraging AI in infrastructure is a plus.
The Service Reliability Engineer will design and evolve tooling to empower development squads to own their systems, build self-service infrastructure platforms for teams, and implement monitoring, alerting, and auto-scaling systems to ensure platform reliability. They will automate operations, develop advanced observability solutions, expand Infrastructure as Code capabilities, and mentor development teams in adopting SRE practices and tools.