Site Reliability Engineer
Stitch FixFull Time
Mid-level (3 to 4 years)
Candidates should have 5+ years of experience in a large-scale production environment and expertise with on-premise and cloud deployments, scaling, and maintenance of CI/CD tools (Bazel, Github Actions, Jenkins), IaC provisioning tools (Ansible, Chef, Puppet, Salt, Terraform), source code management services (Bitbucket, Gitlab, Github), and monitoring/observability tooling (Prometheus/Grafana, Datadog, Honeycomb, New Relic). Experience deploying applications on Kubernetes at scale is required, along with proficiency in configuring and optimizing load balancers (NGINX, HAProxy, Envoy) and database technologies (relational and non-relational) at web scale. A thorough understanding of SLI/SLOs and their application to increase service reliability is also necessary.
The Engineer III - Reliability will build software and systems to manage platform infrastructure and applications, support Crowdstrike's primary CI/CD build tools, and develop automation solutions for service deployment. Responsibilities include monitoring availability, improving system reliability, quality, and serviceability, and architecting highly available services at an enterprise scale. The role involves interacting with internal customers to understand needs and develop solutions, championing Incident Response and Production Readiness Reviews, and gathering/analyzing metrics for performance tuning and root cause analysis. Additionally, the engineer will be responsible for resource, capacity, and license forecasting, partnering with and mentoring other engineers, and configuring/optimizing load balancers, databases, key-value stores, and message brokers.
Cloud-native endpoint security solutions provider
CrowdStrike specializes in cybersecurity, focusing on protecting businesses from cyber threats through cloud-native endpoint security solutions. Their main product, the Falcon platform, includes services like Falcon Pro, which replaces traditional antivirus with next-generation antivirus that integrates threat intelligence, Falcon Insight for endpoint detection and response, and Falcon Device Control to manage connected devices. Unlike many competitors, CrowdStrike's services are subscription-based, allowing clients to choose different levels of protection based on their needs. The company serves a diverse clientele, including many Fortune 100 companies, and is recognized as a leader in the cybersecurity field, known for its effectiveness in threat detection and response.