In-depth experience, knowledge, and skills in own discipline (system analysis, machine learning, AI)
Experience planning and designing new software and web applications
Experience analyzing, testing, and integrating new applications
Experience documenting development activity and training non-technical personnel
Ability to determine own work priorities and act as a resource for colleagues
Proficiency in machine learning models for big data processing in cloud environments (AWS, Google, Azure)
Proficiency in NLP and ML techniques for logs and unstructured data
Proficiency in Python, SQL, and time-series query languages (e.g., PromQL)
Experience with real-time and batch data pipelines for alerts, metrics, traces, and logs
Experience deploying models via API integrations and automating workflows
Experience with A/B testing, offline validation, and performance monitoring of ML models
Experience building dashboards, visualizations, and reporting
Proactive, solution-oriented mindset with ability to navigate ambiguity and learn quickly
Willingness to participate in on-call rotations
Responsibilities
Analyze, design, and tune machine learning models for big data processing using system analysis methods aligned with design patterns in cloud environments (AWS, Google, Azure)
Perform system testing and quality assurance with oversight of quality engineering
Apply NLP and ML techniques to classify and structure logs and unstructured alert messages
Develop and maintain real-time and batch data pipelines to process alerts, metrics, traces, and logs
Use Python, SQL, and time-series query languages (e.g., PromQL) to manipulate and analyze operational data
Collaborate with engineering teams (Platform Engineers, SREs, Incident Managers, Operators, Developers) to deploy models via API integrations, automate workflows, and ensure production readiness
Contribute to self-healing automation, diagnostics, and ML-powered decision triggers
Design and validate entropy-based prioritization models to reduce alert fatigue and elevate critical signals
Conduct A/B testing, offline validation, and live performance monitoring of ML models
Build and share clear dashboards, visualizations, and reporting views for SREs, engineers, and leadership
Research and diagnose complex application problems and identify system improvements in an enterprise environment
Test systems regularly to ensure quality and function, and write instruction manuals
Collaborate on design of hybrid ML/AI + rule-based systems for dynamic correlation and intelligent alert grouping
Document business processes and change algorithms for continuous improvements and assessing complexity in patterns
Prepare cost-benefit analysis on systems platforms, features, and value chain, providing recommendations on underused features
Skills
System Analysis
Software Design
Web Applications
Application Integration
Testing
Machine Learning
Artificial Intelligence
Observability
SRE
Anomaly Detection
Event Correlation
IT Operations
Comcast
Comcast Corporation is a global media and technology company.