Principle Site Reliability Engineer at Global Payments

Dallas, Texas, United States

Apply Now

Not SpecifiedCompensation

Expert & Leadership (9+ years)Experience Level

Full TimeJob Type

UnknownVisa

Payments Technology, Financial ServicesIndustries

Requirements

Candidates should possess a BS in Computer Science, Information Technology, Business/Management Information Systems, or a related field. A minimum of 8+ years of professional experience in coding, designing, developing, and analyzing data is required, along with proficiency in at least two modern enterprise programming languages, experience with various APIs and external services, and familiarity with both relational and NoSQL databases. Experience with public and private clouds, Jenkins, Terraform, Ansible, OpenShift, Kubernetes, or AWS EKS is also necessary. Preferred qualifications include 10+ years of professional experience and experience with IBM Rational Tools.

Responsibilities

The Principle Site Reliability Engineer will be responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of systems. They will apply a software engineering mindset to system administration, splitting time between operations/on-call duties and developing systems to enhance site reliability and performance. This role involves collaborating with DevOps, Development, and Business partners to gather requirements, participating in architecture and R&D discussions for new technologies, and implementing chaos engineering practices to identify and remediate system failures. The engineer will push systems to their performance limits, design solutions for improvement, and utilize DevOps and GitOps practices for automation and self-service. Key duties include safeguarding reliability through high availability, disaster resilience, self-monitoring, and self-healing systems, running game days to test reliability assumptions, reviewing designs for platform stability, building systems for proactive monitoring, improving monitoring and alerting systems, troubleshooting systems and network issues, mentoring other engineers in reliability skills, evolving the SDLC and tooling for SRE best practices, and developing runbooks and documentation.