Network Proficiency: Equivalent of Cisco DevNet certification. Strong understanding of static routing and the BGP protocol
Cloud Experience: Demonstrated experience with at least one public cloud provider, preferably Azure or AWS
CI/CD & Release Management: Proficiency with CI/CD pipeline management, particularly using Azure DevOps or other Git-based tools for continuous integration, delivery, and deployment
Infrastructure Provisioning: Extensive experience with Terraform for automated infrastructure provisioning. Familiarity with Ansible for automated infrastructure configuration desirable
Scripting and Automation: Skilled in scripting with Python to automate network tasks, build integrations, and manage workflows. Strong understanding of REST APIs to create and integrate network automation solutions. Understanding of Docker/Netconf-yang/Linux/API programming/JSON/XML/GitHub
Monitoring and Observability: Strong understanding of monitoring, logging, and observability tools
Responsibilities
Lead the SRE function utilizing agile, SRE principles to deliver on automation of on-prem and cloud infrastructure
Collaborate closely with the Engineering teams to ensure smooth integration of network services into broader infrastructure pipelines
Be hands-on, improve operational efficiency, and develop a vision that leads to our DevOps team's long-term success
Design and manage network configuration code library that deploys secure network infrastructure via CI/CD pipelines
Define, plan and execute strategic roadmaps for self-service, highly scalable, cost-efficient, observable, auditable, and reliable infrastructure services as standard practice, including DevOps and automation
Develop scripts to streamline and automate network tasks and augment our monitoring tools, using Python
Implement and manage CI/CD pipelines, with a strong focus on automation, using Azure DevOps
Develop, test, and manage infrastructure as code (IaC) and maintain accurate configuration management using best practices
Provide support for network-related incidents, identify root causes, and implement preventative measures with automation
Incorporate SRE-centric principles to ensure the reliability, performance, and scalability of large-scale infrastructure
Develop SRE program geared towards reducing incident count and 100% network uptime
Automate disaster recovery plans
Keep systems up and reliable, mitigating broken systems, and preventing future disruptions
Maintain production stability and respond to on-call incidents
Partner with Engineering teams to develop post change/build validation & checkout processes