Site Reliability Engineer
Keeper Security- Full Time
- Junior (1 to 2 years)
Candidates should possess a Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field. A minimum of 3 years of professional work experience in a fast-paced, high-growth environment is required. Extensive experience with Kubernetes and infrastructure-as-code tools such as Terraform, CloudFormation, or Pulumi is necessary, along with familiarity with CI/CD tooling like GitHub Actions or Jenkins. Relevant OSS observability experience with tools like Prometheus or Grafana is a plus, and candidates should demonstrate the ability to own projects end-to-end while being open to learning about machine learning.
As a Site Reliability Engineer, you will build and maintain scalable infrastructure to support the deployment and operation of machine learning models. You will establish standards and best practices for reliability and performance across the infrastructure and automate processes, particularly for managing CI/CD pipelines. The role involves owning products and projects end-to-end, collaborating with cross-functional teams to translate project requirements into technical solutions, mentoring junior team members, and navigating ambiguity while exercising good judgment on tradeoffs and tools needed to solve problems.
Platform for deploying and managing ML models
Baseten provides a platform for deploying and managing machine learning (ML) models, aimed at simplifying the process for businesses. Users can select from a library of open-source foundation models and deploy them with just two clicks, making it easier to implement ML solutions. The platform features autoscaling, which adjusts resources based on demand, and comprehensive monitoring tools for tracking performance and troubleshooting. A key differentiator is Baseten's open-source model packaging framework, Truss, which allows users to package and deploy custom models easily. The company operates on a usage-based pricing model, where clients pay only for the time their models are actively deployed, helping them manage costs effectively.