Site Reliability Engineer
Stitch FixFull Time
Mid-level (3 to 4 years)
Candidates should possess a Bachelor's or Master's degree in Computer Science or a related field, with at least 5 years of experience in Reliability Engineering, QA, or customer-facing engineering. Prior experience operating ClickHouse or other SQL databases in production is required, alongside a strong understanding of distributed database internals and SQL, particularly ClickHouse. Proficiency in scripting with Shell or Python, the ability to read C++ code, and knowledge of cloud computing platforms like AWS, Azure, or GCP are essential. Strong problem-solving and production debugging skills, along with excellent communication, responsibility, ownership, and accountability are also necessary.
The Database Reliability Engineer will be responsible for building and leading processes to ensure and improve the reliability, availability, scalability, and performance of ClickHouse core. This includes collaborating with various teams to guide the implementation of ClickHouse, owning engineering escalation management and response, conducting investigations and post-mortem analyses, and continuously improving how ClickHouse is run and optimized in the cloud. Key duties involve continuously improving ClickHouse core's reliability and performance, enhancing metrics and alerts to prevent production issues, investigating customer-reported problems to identify root causes and submit fixes, refining incident response processes and post-mortem analysis, planning and driving chaos initiatives, and managing on-call processes to resolve performance and reliability issues.
High-speed column-oriented database management system
ClickHouse provides a high-speed, column-oriented database management system designed for developers and businesses that manage large-scale data. Its primary product processes analytical queries quickly by storing data from the same columns together, making it significantly faster than traditional row-oriented databases, especially in Online Analytical Processing (OLAP) scenarios. ClickHouse stands out from competitors by offering a free, open-source database that can be deployed on local machines or in the cloud, along with a fully managed service on platforms like AWS, GCP, and Microsoft Azure. The company's goal is to deliver a cost-effective solution that simplifies data management for its clients, as evidenced by user feedback highlighting substantial cost savings.