Senior Site Reliability Engineer at Arcadia

Chennai, Tamil Nadu, India

Arcadia Logo
Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Clean Energy, Technology, SaaSIndustries

Requirements

  • Experienced Senior Site Reliability Engineer (L3) with proven track record in managing production-grade AWS infrastructure, Kubernetes clusters, CI/CD pipelines, and cloud security
  • Hands-on expertise in designing, building, and maintaining AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
  • Deep knowledge of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
  • Proficiency in CI/CD ecosystems such as Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
  • Skills in automation and scripting (Python/Bash) to reduce operational toil and improve platform reliability
  • Experience with observability tools like Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch for alerting, dashboards, SLO/SLIs
  • Knowledge of FinOps practices for cost optimization, tagging, budgeting, and resource right-sizing
  • Expertise in database operations for MySQL and PostgreSQL (backups, performance tuning, replication, runbooks)
  • Familiarity with secret management tools like Vault, AWS Secrets Manager, and Parameter Store
  • Strong cloud security skills including IAM least privilege, CSPM reviews, GuardDuty/CloudTrail monitoring, and environment hardening
  • Ability to troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems
  • Self-starter, hands-on engineer able to dive deep into complex distributed systems and collaborate daily with US-based engineering teams

Responsibilities

  • Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
  • Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
  • Own and evolve CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
  • Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
  • Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
  • Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
  • Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
  • Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
  • Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
  • Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems
  • Collaborate daily with US-based teams for incident management

Skills

Key technologies and capabilities for this role

AWSKubernetesSREPlatform EngineeringAutomationInfrastructureObservabilityDevOpsSecurity

Questions & Answers

Common questions about this position

Is this Senior Site Reliability Engineer role remote or based in a specific location?

The role is based in India and involves daily collaboration with US-based engineering teams, with the company HQ in Greenwood Village, Colorado. Remote work details are not specified.

What key skills and technologies are required for this Senior SRE position?

Required skills include experience with AWS infrastructure (EKS, VPC, RDS, IAM, etc.), Kubernetes operations (cluster upgrades, performance tuning, GitOps), and CI/CD tools like Jenkins, GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD. Proficiency in Terraform and CloudFormation for infrastructure management is also essential.

What is the compensation or salary for this role?

This information is not specified in the job description.

What is the company culture like at Arcadia?

Arcadia fosters out-of-the-box thinking and diverse perspectives from different backgrounds, industries, and educational experiences, building a team passionate about clean energy and decarbonization.

What makes a strong candidate for this Senior Site Reliability Engineer role?

A strong candidate is a self-starter and hands-on engineer with a proven track record in managing production AWS infrastructure, Kubernetes, and CI/CD, who can dive deep into distributed systems, automate processes, and collaborate with cross-functional teams.

Arcadia

Data-driven healthcare solutions and analytics

About Arcadia

Arcadia focuses on improving healthcare outcomes through data-driven solutions in the healthcare sector, particularly in population health management. The company analyzes and manages the health outcomes of groups of people, serving clients such as healthcare providers, insurance companies, and government agencies. Its main product is a data platform that uses big data technology to process and store large volumes of healthcare data, allowing organizations to access and analyze this information effectively. This leads to better decision-making and enhanced patient care. Unlike many competitors, Arcadia offers a comprehensive suite of tools and consulting services that help clients optimize their use of the platform, particularly in areas like STARS HEDIS and risk adjustment accuracy. The goal of Arcadia is to improve efficiency in healthcare delivery, reduce disparities, and achieve better health outcomes for populations.

Boston, MassachusettsHeadquarters
2002Year Founded
$28.7MTotal Funding
DEBTCompany Stage
Consulting, HealthcareIndustries
501-1,000Employees

Benefits

Flexible Work Hours
Unlimited Paid Time Off

Risks

Integration challenges from CareJourney acquisition may disrupt operations.
Departure of former CTO Jonathan Cook could impact Arcadia's innovation.
Intensifying competition in healthcare data analytics threatens Arcadia's market share.

Differentiation

Arcadia integrates CareJourney's market intelligence for comprehensive healthcare insights.
Arcadia's platform offers real-time data analysis for improved healthcare decision-making.
Arcadia's generative AI assistant enhances care team efficiency and reduces burnout.

Upsides

Arcadia's acquisition of CareJourney expands its customer portfolio to nearly 200.
The healthcare data market is projected to triple by 2030, benefiting Arcadia.
Arcadia's AI assistant boosts productivity by reducing data interpretation time.

Land your dream remote job 3x faster with AI