Site Reliability Engineer
Keeper Security- Full Time
- Junior (1 to 2 years)
Candidates should possess 8+ years of experience in infrastructure, DevOps, or Site Reliability Engineering roles with increasing responsibility, along with proven experience scaling distributed systems in a high-availability, production environment. Expertise with Kubernetes, Terraform, containerization, and at least one major cloud provider (AWS preferred) is required, as well as strong knowledge of system design, networking, and reliability principles.
The Head of Site Reliability Engineering will build and lead the SRE function at Shakudo, setting goals, technical direction, and driving team culture. They will own uptime, reliability, and incident response for the platform, architect scalable infrastructure using Kubernetes, cloud-native tooling, and automation frameworks, and create and enforce best practices for CI/CD, disaster recovery, and service-level objectives (SLOs). Furthermore, they will partner closely with engineering and product to ensure new features are reliable and production-ready, mentor engineers, and instill a culture of operational excellence.
End-to-end platform for AI projects
Shakudo offers a platform designed to support organizations in developing and managing AI and data-intensive products. Their main product, the Hyperplane platform, facilitates the entire workflow of AI projects, from initial ideas to deployment. It automates the optimization of resources, helping teams select the best configurations without the need for complex setups. This makes it easier for data scientists and AI teams to focus on building and maintaining their models efficiently. Shakudo differentiates itself from competitors by providing a subscription-based service with tiered pricing, allowing organizations to choose the level of access that suits their needs. The goal of Shakudo is to simplify the AI development process, enabling organizations to implement their AI projects more quickly and reliably.