SLOs and error budgets, incident response, observability/APM, chaos engineering, and capacity planning to keep systems reliable at scale. This track teaches how to define SLOs and error budgets, instrument and observe systems, run effective on-call, and use chaos and capacity practices to prevent outages. Build resilient services while maintaining delivery speed.
Target Audience
SREs, platform/ops engineers, backend engineers, tech leads, engineering managers, incident commanders.