Site Reliability Engineering (SRE) Architect - CRL - Germany at Infosys

Munich, Bavaria, Germany

Infosys Logo
Not SpecifiedCompensation
Senior (5 to 8 years), Expert & Leadership (9+ years)Experience Level
Full TimeJob Type
UnknownVisa
TechnologyIndustries

Requirements

  • 10+ years of experience in software engineering, DevOps, or systems engineering, with at least 5 years in a senior SRE or systems architecture role
  • Expert-level knowledge of at least one major cloud provider (AWS, GCP, or Azure), including core services like compute, storage, networking, and managed databases
  • Deep, hands-on experience designing and managing large-scale Kubernetes clusters and container-based microservices architectures
  • Proven expertise in architecting infrastructure with Terraform
  • Proficiency with configuration management tools like Ansible, Chef, or Puppet
  • Extensive experience designing and implementing monitoring and observability solutions using tools like Prometheus, Grafana, OpenTelemetry, Jaeger, and the ELK Stack (Elasticsearch, Logstash, Kibana) or similar commercial tools (e.g., Datadog, New Relic)
  • Strong proficiency in a high-level programming language such as Go or Python

Responsibilities

  • Architectural Design & Strategy: Design and architect robust, scalable, and fault-tolerant infrastructure and application services on public cloud platforms (AWS, GCP, Azure). Define the long-term vision for system reliability and performance
  • Reliability Frameworks: Establish and govern the standards for Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across all engineering teams
  • Observability & Telemetry: Architect a comprehensive observability strategy. Design the systems for logging, metrics, tracing, and alerting to provide deep insights into system health and facilitate rapid incident response
  • Automation & Infrastructure as Code (IaC): Lead the strategy for automation and IaC. Design reusable patterns and frameworks using tools like Terraform and Ansible to ensure consistent, repeatable, and secure infrastructure provisioning
  • Resilience & Chaos Engineering: Proactively identify and mitigate reliability risks. Design and champion the implementation of resilience patterns, disaster recovery plans, and chaos engineering experiments to validate system robustness
  • Technical Leadership & Mentoring: Act as a thought leader and subject matter expert in reliability engineering. Mentor SREs and developers, evangelize best practices, and lead architectural review sessions to ensure reliability is a core component of every feature
  • Incident Management Evolution: Analyze major incidents to identify architectural weaknesses and drive the necessary design changes to prevent recurrence. Help evolve postmortem culture and incident response capabilities

Skills

AWS
GCP
Azure
Terraform
SLOs
SLIs
Error Budgets
Observability
Logging
Metrics
Tracing
Alerting
IaC

Infosys

Global consulting & IT services

About Infosys

N/AHeadquarters
1981Year Founded
N/ACompany Stage
10,001+Employees

Land your dream remote job 3x faster with AI