Site Reliability Engineer
Stitch FixFull Time
Mid-level (3 to 4 years)
Candidates must have over 5 years of experience with enterprise monitoring tools such as Prometheus, LogicMonitor, Datadog, ThousandEyes, or Zscaler Digital Experience (ZDX). A strong proficiency in scripting languages like Python, Bash, or PowerShell for automation is required, along with experience in log management platforms (ELK stack, Splunk, LogScale) and cloud services monitoring (AWS CloudWatch, GCP). Knowledge of SRE principles, SLOs, error budgets, incident management, automated alerting, remediation workflows, and CI/CD pipeline monitoring is essential. Familiarity with Infrastructure as Code (Terraform, Ansible) and containerization (Docker, Kubernetes) is also needed.
The IT Monitoring Engineer/Site Reliability Engineer will design, implement, and maintain monitoring solutions for critical IT infrastructure and applications, ensuring their reliability, availability, and performance. Responsibilities include configuring alerting thresholds, defining and tracking SLOs, creating real-time health dashboards, and conducting system reliability reviews. The role involves participating in on-call rotations, leading incident response efforts, conducting post-incident reviews, and documenting resolutions. Additionally, the engineer will develop scripts and automation for monitoring tasks, create self-healing systems, integrate monitoring tools, and collaborate with development, infrastructure, and security teams. Staying current with industry trends, analyzing monitoring data for improvements, and contributing to the organization's monitoring strategy are also key duties.
Cloud-native endpoint security solutions provider
CrowdStrike specializes in cybersecurity, focusing on protecting businesses from cyber threats through cloud-native endpoint security solutions. Their main product, the Falcon platform, includes services like Falcon Pro, which replaces traditional antivirus with next-generation antivirus that integrates threat intelligence, Falcon Insight for endpoint detection and response, and Falcon Device Control to manage connected devices. Unlike many competitors, CrowdStrike's services are subscription-based, allowing clients to choose different levels of protection based on their needs. The company serves a diverse clientele, including many Fortune 100 companies, and is recognized as a leader in the cybersecurity field, known for its effectiveness in threat detection and response.