Let's walk through how you, as an operations engineer, can identify and solve 5 real problems using bpftrace to kickstart you on the way to using this technology to solve your next mystery while also learning a "fun" level of detail about how traditional and bpftrace based tools work at a low level and how their performance impact compares as a result.
Perhaps even more than you, as a Support Engineer at Canonical I often do not have the luxury of modifying the software under observation, additionally, many difficult to diagnose problems manifest as outlier cases requiring us to statistically measure and correlate requests in ways that don't already exist.
Historically this can be challenging as sampling those outliers requires specific debug/analysis code to be added and systems to be restarted. Instead, dynamic runtime tracing combined with BPF in the form of bpftrace allows us to load very small and fast programs into the kernel that run in the hotpath and summarise or analyse exactly the events we need, transmitting only a very small amount of data out of the kernel to be analysed in userspace. This instrumentation is installed at runtime with no changes to the system or a substantial impact on system latency or performance.
While such programs can be written as more complex and clunky C+Python scripts, 'bpftrace' allows us to write these in a nice Domain Specific Language (DSL) that combines the in-kernel data collection and userspace analysis components into a single coherent script that is writable and understandable even by those who may not be software engineers or kernel experts.
Session author's bio
Trent Lloyd is a member of Canonical's Support Sustaining Engineering team, assisting Ubuntu users on a wide variety of cases specialising in Ceph, OpenStack and Networking. A long time passionate speaker and member of the Linux & Open Source community he spent 9 years in the Global MySQL Support Team before joining Canonical in 2016.
|Level of Difficulty||Intermediate|