Mid-level (3 to 4 years), Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Telecommunications, Media, TechnologyIndustries
Requirements
2-5 years of hands-on experience working as Incident & Problem Management
Familiarity with Site Reliability Engineering (SRE) principles
Exceptional written and oral communication skills, with ability to articulate complex emergent situations clearly to all levels of the organization
Hands-on experience in configuring alerts using platforms like Grafana, Prometheus, Kibana
Familiarity with technologies like IP Networking, Databases, Application architectures, Loadbalancers, API, Microservices, Web Services (SOAP XML), Python
Experience working in Public cloud environment (AWS, Azure) will be helpful
Ability to work in a fast-paced 24x7 technical operations environment
Compulsory adherence to Return to Office Policy
Must be able to work variable schedule(s) & days as necessary
Attains all relevant industry standard technical certifications
Responsibilities
Effective Incident Manager during outages who has the respect of management and Engineer fix agents
Effective Bridge Management, Incident Communications and driving the incident towards Mitigation
Identify opportunities for improving the Observability Gaps and work with DevOps in Alert Configuration & Fine tuning, Alert Onboarding and suppress Alert noises
Act as a thought-leader, technical expert and first point of reference for leading practices in reliability engineering
Develops strong technical knowledge of the applications and services
Partners with Engineering and Deployment peers to drive rigorous root cause investigations and action items which improve system availability and resiliency
Collaborates with development teams to understand application changes and identify potential issues that may arise, create implementation and back-out plans, and oversee the implementation during the scheduled maintenance window
Analyzes data and metrics, identifies problem areas and provides actionable insight to management
Provides input to engineering and vendors on defects and required enhancements
Performs complex and routine maintenance tests for designated areas of engineering. Identifies and isolate issues. Ensures that all maintenance is properly validated to minimize subscriber impact to (ideally) zero
Skills
Incident Management
Problem Management
Change Management
Bridge Management
Incident Communications
Observability
OSS/BSS
Middleware
Reliability Engineering
Technical Leadership
Comcast
Comcast Corporation is a global media and technology company.