Proven experience with chaos engineering or fault injection, ideally in distributed, production-scale environments
Comfortable with iOS platforms, mobile networking, and understanding how client-side failures impact backend systems
Strong experience with Swift programming
Strong understanding of resilience patterns (e.g., circuit breakers, bulkheads, timeouts, retries) and system failure modes
Prior involvement in incident postmortems, war games, or reliability reviews
Comfortable building tools or scripts to automate chaos experiments and analyse system behavior under stress
Scientific mindset, love forming hypotheses, testing limits, and uncovering how systems really behave at the edge
Excited to build a program from scratch, not just join one
Responsibilities
Define the chaos engineering strategy at Goodnotes, including tools, safety practices, and long-term roadmap
Design and run fault injection experiments across mobile and backend systems, targeting failure points in user flows, APIs, and infrastructure components to surface hidden risks
Simulate real-world issues like latency spikes, dependency outages, cascading failures, and resource exhaustion
Build and scale tooling for automating experiments, tracking outcomes, and improving observability
Establish clear guardrails and blast radius controls to ensure experiments are safe, measured, and reversible
Collaborate across engineering teams to identify critical flows, formulate hypotheses, and stress-test assumptions
Facilitate resilience drills and chaos game days, driving cross-team engagement and response readiness
Document findings, communicate insights, translate chaos learnings into actionable improvements, and influence our engineering teams to enact recommended changes
Help shape the future of the chaos engineering function — including mentoring and hiring as the team grows