Solving an Internal Real-World SRE Issue with Eclipse Trace Compass
Presented by Matthew Khouzam (Ericsson AB) at EclipseCon 2022.
This talk will guide you on the steps taken to solve a real SRE (Site Reliability Engineering) issue we faced. This problem caused a slowdown for several thousand developers, but no service loss. We will show how good logging/tracing strategies and pre-emptive log post-mortems can save a company hundreds of hours.
This talk is a recount of real events where the names, timestamps, file paths, and IP (Internet Protocol) addresses have been changed for privacy reasons. However, the issue remains the same and visible.
The talk is broken down into the following steps:
Get a multi-gigabyte log
Create a parser for it
Save the parser, and share it with a colleague
Analyze the data
Create a custom analysis using Trace Compass’s custom analysis parser
Share the results
Modify Trace Compass slightly to highlight the site issue
At the end, the spectator should be able to use this tool to understand problems on a site level, and potentially contribute steps of how to solve the issue.
This talk is not just addressed to system administrators though, developers, dev(sec)ops engineers, managers and anyone interested in knowing how their systems are being used is welcome to attend.