Site Reliability Consultant
Location: Costa Rica | Remote | Work from Home
Employment Type: Full Time
Position Overview
Pythian is building a next-generation Site Reliability Engineering team and is seeking motivated and talented individuals to join. As a Site Reliability Consultant, you will act as a technology leader and advisor for clients, as well as a mentor for other team members. Projects will focus on infrastructure architecture, automation, and intelligent monitoring systems, from design through implementation. If you are passionate about data and eager to advance your career, this role is for you.
Responsibilities
- Operate, maintain, and administer solutions to enhance customer infrastructure's operational efficiency, availability, and visibility.
- Plan maintenance activities, create design documentation, and develop standard procedures.
- Provide Root Cause Analysis reports for outages/incidents (ITIL - Problem Management).
- Observe and provide feedback on the current state of client infrastructure, identifying opportunities for improvement in resiliency, incident reduction, and automation of repetitive tasks.
- Contribute to, improve, and maintain team documentation regarding client systems, infrastructure, procedures, policies, and schedules.
- Gather and document information about client environments through audit activities, analyzing it to identify improvement opportunities and best practices.
- Collaborate with teammates to foster continuous improvement in the team's working culture.
- Act as a technology leader for clients and drive discussions on technology roadmaps.
- Participate in an on-call rotation in an escalation capacity.
Requirements
- Experience with Google and AWS Clouds, including infrastructure as code deployment (Cloud Formation, Terraform, Opsworks, etc.).
- Proficiency in scripting and automation of administrative tasks using Python and Scala.
- Solid understanding of microservices architecture and container technologies (Kubernetes is a must, Docker, lxc, etc.).
- Clear understanding of software development lifecycles and best practices from an infrastructure perspective (PRs, merge, rebase, etc.).
- Understanding of end-to-end operations of a ‘Business System’ versus its components.
- Comprehensive systems hardware and network troubleshooting experience.
- Experience with common Linux distribution platform installation, configuration, performance tuning, and cloud migration.
- Knowledge of TCP/IP networking, NIC bonding, and network services configuration (DNS, NTP, DHCP, SMTP, etc.).
- Experience with the operation and administration of virtual infrastructure, including at least one hypervisor (VMware, Hyper-V, KVM, etc.).
- Ability to describe IaaS, PaaS, SaaS, their pros and cons, and use cases for virtualization and cloud.
- Experience with administration of web servers and supporting technologies, including network load balancers.
- Experience in the design, development, and deployment of Puppet.
- Experience with system and application error investigation, troubleshooting of access/availability issues, including deep multi-system root cause analysis.
- Experience managing networking devices, such as switches and firewalls from various vendors.
- Solid understanding of DevOps tools, processes, and culture.
- Ability to quickly learn new technologies.
- Ability to provide accurate work scheduling and task estimations for work delivery.
What You Get in Return
- Love your career: Competitive total rewards package. Opportunities to blog during work hours and take time off to volunteer for your favorite charity.
- Love your work/life balance: Flexible remote work from home with no daily travel requirements to an office. All you need is a stable internet connection.