Site Reliability Engineer
Stitch FixFull Time
Mid-level (3 to 4 years)
Candidates should have 7+ years of experience in software engineering with a strong focus on production systems and distributed architectures, thriving in high-leverage roles that improve how everyone else builds, ships, and fixes software. They should have led or played a significant role in incident response, building systems, and culture around continuous improvement, and be excited by complexity, not afraid of it, and deeply motivated to make systems safer and teams faster. Experience working on distributed systems at scale, familiarity with Kafka/Redpanda, PostgreSQL or other SQL databases, MongoDB/NoSQL databases, Clickhouse or other OLAP databases, and a deep understanding of release automation, CI/CD, and code lifecycle management are also required.
The Production Engineer will drive reliability and observability improvements across large-scale distributed systems, serve as a force multiplier across all engineering teams by reducing downtime, improving tooling, and freeing up senior engineers from firefighting, own and evolve the company’s incident review process, leading postmortems and embedding learnings into tools, practices, and culture, collaborate with teams to improve release hygiene including automating release gating, preventing code from stagnating in staging environments, and implementing pre-prod automated test pipelines, build and maintain Nominal’s gRPC middleware to ensure safe, observable, and performant service communication, improve alerting, debugging, and monitoring to ensure production health and rapid root cause analysis.
Software tools for engineering hardware systems
Nominal.io provides software tools designed specifically for engineering teams working with complex hardware systems. Their platform allows these teams to test and deploy hardware systems significantly faster than traditional methods, making it particularly beneficial for industries such as aerospace, defense, energy, and telecommunications, where hardware performance is critical. The platform consolidates data from various sources, enabling engineers to monitor and analyze their systems effectively in a secure environment. Unlike many competitors, Nominal.io focuses on a niche market with high demands for reliability, offering a software-as-a-service (SaaS) model that ensures clients have continuous access to the latest features. The company's goal is to enhance the resilience and performance of hardware systems, positioning itself as a key partner for engineering teams looking to improve their deployment processes.