Site Reliability Engineer
Goodleap- Full Time
- Junior (1 to 2 years)
Candidates should possess 5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles, demonstrating deep expertise in Kubernetes administration and architecture, a strong track record of leading CI/CD and platform engineering initiatives, and experience working on cloud-native infrastructure such as AWS, GCP, or Azure. Advanced experience with monitoring, observability, and logging platforms like DataDog, New Relic, SumoLogic, or Splunk is required, along with proficiency in at least one programming language such as Python, Ruby, or Go.
The Senior Site Reliability Engineer 3 will lead the design and implementation of complex platform engineering solutions, drive architectural decisions for the CI/CD infrastructure and Kubernetes platform, mentor junior team members and provide technical leadership, develop and implement strategic initiatives to improve developer experience and platform reliability, design and implement scalable solutions for infrastructure automation using Terraform and other IaC tools, lead post incident reviews and drive systematic improvements, collaborate with other engineering teams globally, champion observability and monitoring best practices, and participate in a 24/7 on-call rotation using PagerDuty.
Incident management and response platform
PagerDuty specializes in incident management and response, providing a platform that helps organizations quickly address IT issues to minimize operational disruptions. The platform integrates with various monitoring tools to detect incidents in real-time, alerting the right personnel for swift action. This process aids in reducing downtime and maintaining service quality across sectors like technology, finance, healthcare, and retail. PagerDuty operates on a subscription-based model, offering different pricing tiers based on user count and feature levels, which ensures a steady revenue stream. The company also provides premium support and professional services, enhancing its offerings. Overall, PagerDuty aims to help organizations efficiently manage and resolve IT incidents, ensuring the reliability of their digital services.