About QuickNode
QuickNode is a cloud-based infrastructure company that powers the blockchain ecosystem. Our mission is to be the indispensable utility that empowers companies and innovators globally to build next-generation, Web3 enabled businesses & applications using blockchain technology. QuickNode is backed by some of the world's best investors including Tiger Global, Y Combinator, SoftBank, and the Seven Seven Six Fund. The QuickNode team has over 120 people maintaining high performance global data infrastructure for amazing customers serving billions of requests daily. We are a global remote company with an HQ in Miami, Florida.
The Role
We’re seeking a seasoned Technical Operations Engineer to ensure the stability, reliability, and performance of our production systems. In this key role, you’ll leverage deep technical expertise, particularly in Web3/blockchain technologies, to manage, optimize, and enhance our platform infrastructure. You’ll drive operational excellence through proactive monitoring, meticulous incident management, innovative problem-solving, and collaborative cross-team initiatives.
What You'll Do
- Blockchain Network Management: Lead the deployment, optimization, and operational management of new blockchain networks. Conduct thorough testing, benchmarking, and continuous improvement of chain reliability and performance.
- Complex Web3 Issue Resolution: Address high-impact Web3 incidents through rigorous troubleshooting, detailed log analysis, JSON-RPC response debugging, and direct coordination with blockchain foundations and ecosystem partners.
- Proactive System Monitoring: Develop and maintain comprehensive monitoring and alerting solutions using advanced dashboards (e.g., Grafana, DataDog), identifying trends, anomalies, and performance bottlenecks before they become critical.
- Incident & SLO Management: Define, implement, and enforce service-level objectives (SLOs) and agreements (SLAs), ensuring measurable standards of system reliability and performance are consistently met.
- Automation & Optimization: Implement and maintain automation solutions (Ansible, Terraform, Kubernetes) to streamline deployments, reduce manual tasks, and optimize cloud infrastructure cost and efficiency.
- Technical Collaboration: Actively collaborate with Tier-1 support, infrastructure, and development teams, ensuring alignment on system improvements, rapid issue resolution, and operational knowledge sharing.
- On-Call Support: Participate in a rotating 24/7 on-call schedule to swiftly address critical system incidents, maintain continuous service delivery, and uphold customer trust.
What You'll Bring
- Minimum of 5 years in Technical Operations, Site Reliability Engineering (SRE), or related roles.
- Proven Linux/Unix system administration and advanced troubleshooting capabilities.
- Deep experience managing complex Web3 infrastructures (RPC services, validator setups, node operations).
- Skilled in interpreting blockchain logs, JSON-RPC responses, and debugging intricate Web3 protocol issues.
- Solid hands-on experience with configuration management and infrastructure automation tools (Helm, Terraform, Ansible, Consul), including containerization expertise (Docker, Kubernetes), managing and scaling services in cloud environments.
- Competency in scripting/programming languages (Python, Go, JavaScript).
- Advanced proficiency in monitoring and analytics platforms (Grafana, DataDog), enabling proactive and data-driven operational decision-making.
- Demonstrated ability to identify performance patterns, forecast potential issues, and implement preventive solutions.
- Strong track record defining, measuring, and maintaining SLAs/SLOs, and experienced with incident response tooling and processes (PagerDuty), ensuring quick resolution and systematic root-cause analyses.
- Willing to travel on a limited basis for conferences, offsites and/or meetings, generally less than 10 days per year.
- Exceptional interpersonal and communication skills.
Employment Details
- Employment Type: FullTime
- Location Type: Remote