DevOps Engineering Manager
Employment Type: Full-time
About the Engineering Organization
The Engineering Team at Loop is built on the pillars of agility, consistency, and performance. These principles enable the team to consistently deliver value that matters to our customers. This customer intimacy allows our engineering teams to be the best in our space and bring innovative ideas to market.
About the Role
We are seeking a DevOps Engineering Manager to lead a high-performing DevOps group while maintaining technical engagement. Your primary objective is to foster self-service, empowering every engineer to deploy, observe, and recover their services without requiring ticket escalations. You will be responsible for the evolution of our AWS-based platform, including the design of multi-region architectures, advanced deployment patterns (blue/green, canary, feature flags), and resilient machine learning pipelines. This will ensure Loop can continue to scale traffic and transaction volume efficiently. In collaboration with Staff and Directors, you will help define the vision, mentor the team, and actively contribute by writing Terraform, tuning Kubernetes, and resolving production incidents.
Our Blended Work Environment
Loop utilizes a Blended Working Environment to optimize our work. We operate from our HQ in Columbus, OH, or from our Hub or Secluded locations, with team members distributed across the United States, select Canadian provinces, and the United Kingdom. For this position, we are looking for candidates located in Columbus, OH; Chicago, IL; Austin, TX; Los Angeles, CA, or fully remote.
Our Tech Stack
- Cloud: AWS Cloud (Kubernetes, Serverless architecture, Redis, Aurora, DynamoDB, identity management)
- Containerization: Docker
- MLOps: MLFlow
- CI/CD & Orchestration: Gitlab, Airflow
- Programming Languages: PHP/Laravel
- Operating Systems: Linux
- Infrastructure as Code: Terraform
- Monitoring & Observability: Datadog
- Data Warehousing & Transformation: Snowflake, dbt
What You’ll Do
- Drive self-service and automation at Loop by designing golden-path workflows that enable product teams to provision infrastructure, integrate monitoring, and release safely.
- Lead the strategy and execution for scaling to multi-region by implementing active-active/active-standby architectures, cross-region data replication, and global traffic management.
- Champion the evolution of deployment patterns, including blue/green, canary, feature-flag, and immutable-infra releases, to minimize risk and Mean Time To Recovery (MTTR).
- Implement Site Reliability Objectives (SLOs), error budgets, chaos testing, and auto-remediation playbooks to enhance reliability and manage the infrastructure on-call rotation culture.
- Mentor and develop a diverse team of DevOps engineers, DBAs, and MLOps Engineers, fostering the growth of future engineering leaders.
- Collaborate closely with Product Engineering Teams, our Data Team, and other stakeholders to align roadmaps and improve development velocity.
- Contribute hands-on by writing Terraform modules, optimizing Helm charts, reviewing merge requests, assisting the team with reactive tasks, and participating in high-severity incident calls.
Your Experience
- Leadership: 4+ years of proven DevOps leadership experience managing DevOps or Site Reliability Engineering (SRE) teams.
- Hands-on Experience: 7+ years in hands-on platform or infrastructure roles.
- Self-Service Track Record: Demonstrated success in delivering internal platforms or portals that empowered hundreds of engineers for autonomous deployments.
- AWS Expertise: Extensive experience with large-scale AWS, including designing and operating multi-region, high-throughput systems supporting over $100 million in annual Gross Merchandise Value (GMV).
- CI/CD & IaC: Expertise in advanced CI/CD pipelines and Infrastructure as Code (IaC), proficient with tools such as GitLab CI (or similar), Terraform, and Kubernetes. Comfortable introducing progressive delivery, policy-as-code, and secrets management at scale.
- Observability: Deep understanding of metrics, tracing, logging, and alerting, with experience using Datadog or comparable stacks.
- Security: Familiarity with cloud security best practices.