Shakudo

Head of Site Reliability Engineering

Toronto, Ontario, Canada

Not SpecifiedCompensation
Senior (5 to 8 years), Mid-level (3 to 4 years)Experience Level
Full TimeJob Type
UnknownVisa
Cloud Computing, DevOps, Software EngineeringIndustries

Job Description

Position Overview

  • Location Type: Remote
  • Employment Type: Full Time
  • Salary: Not specified

Shakudo is building the world’s first operating system for data and AI. We use the term operating system in the truest sense of the word. Like iOS, Windows and Linux, Shakudo’s end-to-end OS offers ever-evolving, automatically operated, best-of-breed open-source components tailored to each business's unique needs.

This role is ideal for someone who thrives on solving infrastructure challenges, scaling cloud-native systems, and building high-performance teams. You will work cross-functionally with engineering, product, and customer success to make Shakudo’s platform rock-solid and resilient for our customers around the world.

Requirements

  • 8+ years of experience in infrastructure, DevOps, or SRE roles with increasing responsibility
  • Proven experience scaling distributed systems in a high-availability, production environment
  • Expertise with Kubernetes, Terraform, containerization, and at least one major cloud provider (AWS preferred)
  • Strong knowledge of system design, networking, and reliability principles
  • Experience with observability tools (e.g., Prometheus, Grafana, Datadog) and incident response practices
  • Strong leadership and communication skills, with a hands-on, collaborative approach

Responsibilities

  • Build and lead the SRE function at Shakudo, setting goals, technical direction, and driving team culture
  • Own uptime, reliability, and incident response for our platform
  • Architect scalable infrastructure using Kubernetes, cloud-native tooling, and automation frameworks
  • Lead the design of observability, monitoring, and alerting systems to proactively detect and prevent issues
  • Create and enforce best practices for CI/CD, disaster recovery, and service-level objectives (SLOs)
  • Partner closely with engineering and product to ensure new features are reliable and production-ready
  • Mentor engineers and help instill a culture of operational excellence

Nice to Have

  • Experience supporting data pipelines, ML workloads, or complex orchestration systems
  • Familiarity with the data/ML tooling ecosystem (e.g., Airflow, dbt, Spark, Dremio, etc.)
  • Previous experience in a startup or high-growth environment

Company Information

Shakudo is an equal opportunity employer and encourages candidates of all backgrounds to apply. We foster diversity and inclusivity and welcome applications from a broad range of backgrounds and experiences.

Skills

Kubernetes
Terraform
Containerization
AWS
Prometheus
Grafana
Datadog
System Design
Networking
Reliability
Incident Response
CI/CD
Disaster Recovery
SLOs

Shakudo

End-to-end platform for AI projects

About Shakudo

Shakudo offers a platform designed to support organizations in developing and managing AI and data-intensive products. Their main product, the Hyperplane platform, facilitates the entire workflow of AI projects, from initial ideas to deployment. It automates the optimization of resources, helping teams select the best configurations without the need for complex setups. This makes it easier for data scientists and AI teams to focus on building and maintaining their models efficiently. Shakudo differentiates itself from competitors by providing a subscription-based service with tiered pricing, allowing organizations to choose the level of access that suits their needs. The goal of Shakudo is to simplify the AI development process, enabling organizations to implement their AI projects more quickly and reliably.

Toronto, CanadaHeadquarters
2021Year Founded
$9.8MTotal Funding
SERIES_ACompany Stage
Enterprise Software, AI & Machine LearningIndustries
11-50Employees

Risks

Increased competition from established AI platform providers like DataRobot.
Potential customer resistance due to learning curve and integration challenges.
Rapid AI advancements may outpace Shakudo's platform updates, risking obsolescence.

Differentiation

Shakudo offers a unique compatibility across best-of-breed data tools.
Their Hyperplane platform automates resource optimization for AI projects.
Shakudo provides DevOps-friendly GraphQL APIs for seamless AI solution interaction.

Upsides

Increased demand for AI model interpretability tools enhances Shakudo's platform.
Rise of AI model marketplaces boosts Hyperplane's user engagement and revenue.
Growing trend of federated learning aligns with Shakudo's data stack compatibility.

Land your dream remote job 3x faster with AI