10+ years of experience in software engineering, DevOps, or systems engineering, with at least 5 years in a senior SRE or systems architecture role
Expert-level knowledge of at least one major cloud provider (AWS, GCP, or Azure), including core services like compute, storage, networking, and managed databases
Deep, hands-on experience designing and managing large-scale Kubernetes clusters and container-based microservices architectures
Proven expertise in architecting infrastructure with Terraform
Proficiency with configuration management tools like Ansible, Chef, or Puppet
Extensive experience designing and implementing monitoring and observability solutions using tools like Prometheus, Grafana, OpenTelemetry, Jaeger, and the ELK Stack (Elasticsearch, Logstash, Kibana) or similar commercial tools (e.g., Datadog, New Relic)
Strong proficiency in a high-level programming language such as Go or Python
Responsibilities
Architectural Design & Strategy: Design and architect robust, scalable, and fault-tolerant infrastructure and application services on public cloud platforms (AWS, GCP, Azure). Define the long-term vision for system reliability and performance
Reliability Frameworks: Establish and govern the standards for Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across all engineering teams
Observability & Telemetry: Architect a comprehensive observability strategy. Design the systems for logging, metrics, tracing, and alerting to provide deep insights into system health and facilitate rapid incident response
Automation & Infrastructure as Code (IaC): Lead the strategy for automation and IaC. Design reusable patterns and frameworks using tools like Terraform and Ansible to ensure consistent, repeatable, and secure infrastructure provisioning
Resilience & Chaos Engineering: Proactively identify and mitigate reliability risks. Design and champion the implementation of resilience patterns, disaster recovery plans, and chaos engineering experiments to validate system robustness
Technical Leadership & Mentoring: Act as a thought leader and subject matter expert in reliability engineering. Mentor SREs and developers, evangelize best practices, and lead architectural review sessions to ensure reliability is a core component of every feature
Incident Management Evolution: Analyze major incidents to identify architectural weaknesses and drive the necessary design changes to prevent recurrence. Help evolve postmortem culture and incident response capabilities