Senior Engineering Manager, Site Reliability
DittoFull Time
Expert & Leadership (9+ years)
Candidates must possess a Bachelor's degree in Computer Science or a related field, or equivalent work experience. A minimum of 7 years of engineering experience, with at least 2 years in a production role and 2 years of hands-on management experience leading engineering teams, is required. Demonstrated success in driving complex technical initiatives across organizational boundaries, solid design and problem-solving skills, and a passion for engineering excellence, quality, security, and performance are essential. Strong cross-group collaboration and interpersonal communication skills are necessary, along with experience leading teams working with technologies such as Linux, VMWare, FreeBSD, and Storage Area Networks. Experience leading distributed teams in a remote-first environment and proficiency in hybrid/on-prem cloud environments are also required, as is a deep understanding of distributed systems and reliability engineering principles. Bonus points are awarded for strategic thinking, stakeholder management, successful cross-functional collaboration, strong change management skills, and the ability to build consensus.
The SRE Manager will lead a team of SRE engineers, overseeing day-to-day operations, infrastructure management, and operational excellence in a production environment. Responsibilities include managing and mentoring the team, conducting performance reviews, career development planning, and hiring. The role involves fostering a culture of reliability, automation, and continuous improvement, and coordinating cross-functional collaboration with Engineering, Security, and Operations teams. Key duties include overseeing 24/7 monitoring and incident response, driving SLI/SLO definition and monitoring, leading post-incident reviews, and implementing preventive measures. The manager will also ensure compliance with security and regulatory requirements, develop and execute reliability roadmaps, manage capacity planning and infrastructure scaling, and make technology evaluation and adoption decisions. Budget planning and resource allocation are also part of the role, as is championing infrastructure-as-code and automation initiatives, establishing and improving operational procedures, driving the adoption of observability and monitoring, and implementing table-top exercises and disaster recovery testing.
Cloud-native endpoint security solutions provider
CrowdStrike specializes in cybersecurity, focusing on protecting businesses from cyber threats through cloud-native endpoint security solutions. Their main product, the Falcon platform, includes services like Falcon Pro, which replaces traditional antivirus with next-generation antivirus that integrates threat intelligence, Falcon Insight for endpoint detection and response, and Falcon Device Control to manage connected devices. Unlike many competitors, CrowdStrike's services are subscription-based, allowing clients to choose different levels of protection based on their needs. The company serves a diverse clientele, including many Fortune 100 companies, and is recognized as a leader in the cybersecurity field, known for its effectiveness in threat detection and response.