In-depth experience, knowledge, and skills in own discipline (Site Reliability Engineering)
Ability to work with limited supervision and direction
Ability to independently determine/develop approaches for non-routine solutions
Ability to exercise independent judgment and discretion in matters of significance
Ability to determine own work priorities
Regular, consistent, and punctual attendance
Ability to work nights, weekends, variable schedules, and on-call shifts as necessary
Understanding of Operating Principles and commitment to customer experience, teamwork, and Net Promoter System
Responsibilities
Engineer technical solutions for infrastructure and application management, monitoring, and operations with standardization and automation focus
Collaborate with cross-functional teams to identify and address reliability and performance issues
Provide cybersecurity support including vulnerability cleanup, secure server configuration, testing and validation, technical controls implementation, and incident remediation
Work closely with developers to ensure software releases are well-designed, planned, implemented, released, and monitored
Measure and improve reliability, quality, and efficiency of platforms
Support incident prevention, response, retrospect, and work on-call shifts
Perform complex analytical duties in planning, deployment, testing, and evaluation of products
Contribute to design and implementation of reliable and scalable infrastructure solutions with best practices, tool use, and quality assurance
Monitor system performance and implement improvements to optimize reliability, availability, production quality, operational efficiency, and engineering productivity
Develop and maintain tools for monitoring, deployment, and operations
Provide subject matter expertise, resolve complex break/fix scenarios, and engage broader teams as necessary
Partner with engineering, vendors, and client services to deliver successful technical solutions
Ensure availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning for platforms
Design, analyze, and troubleshoot large-scale distributed systems; debug and optimize code; automate routine tasks
Perform other duties and responsibilities as assigned
Skills
SRE
Distributed Systems
Monitoring
Automation
Linux
Python
AWS
Capacity Planning
Troubleshooting
Change Management
Infrastructure as Code
Performance Optimization
Comcast
Comcast Corporation is a global media and technology company.