Recently, the team at ContainIQ has been working hard to release an eBPF-based profiler for a number of different languages. Continuous profiling is an important practice because it can help engineering teams spot performance bottlenecks and troubleshoot issues faster.
In building our eBPF-based profiler, we learned a number of new techniques that might be interesting to people who want to implement something similar or who simply want to get started writing eBPF-based programs.
In this post, we’ll recap our methodology and process for building our profiler. The techniques used to profile differ based on the language you’re targeting; this post has a section on techniques for compiled languages and a section on techniques for interpreted languages.
Techniques for Compiled Languages
The process for compiled languages is relatively straightforward and uses some already well-known Linux systems. This section highlights the process we used to support compiled languages.
Using the linux perf_event subsystem, we can attach a software event with the cpu_clock config:
After attaching our function to the perf event, we can grab the stack and generate a unique stack ID for counting via the <terminal inline>bpf_get_stackid<terminal inline> helper:
We save the stack trace to a stack trace BPFmap:
And then we use the stack ID to increment a counter for that stack:
Lastly, in user space we utilize a technique called stack walking to generate the symbols for the stack, and then convert to folded format for further analysis.
Stack Walking Examples
Loop through all of the stacks:
Loop through all of addresses within a given trace:
Techniques for Interpreted Languages
Things are more interesting with interpreted languages where symbol resolution isn’t as easy. Existing profiling solutions for interpreted or JIT languages usually require that the language generate a perf-map that correlates symbol addresses to their human readable names or, in some cases, that it read from the process memory, directly mapping addresses to language-specific structs that differ based on version.
Using eBPF, we can take another approach by using language specific USDT probes. USDTs are a low-overhead (sometimes) way of deriving specific insights from the application you’re instrumenting. The code below shows how you can leverage USDTs for specific languages to build out a complete stack.
The first step is to add the probe to the language runtime you’d like to instrument. In the examples below, I’m using libbpf to add the probes and Ruby as the desired language:
After hooking the method entry, we now need to build the stack frame and add it to our maps:
Then we push the stack to user space for further analysis:
When the method returns, we pop it off the stack:
Once we have the information in user space, we iterate over the stacks to generate counts and convert them into folded format for further analysis.
The performance overhead of the function entry and exit probes is, as expected, relatively poor. Without further modifications, the code above can cause significant drags on your application.
Internally, we developed several methods that make the above implementation more performant. In the next post in this series, we’ll outline the performance improvements necessary to make this implementation feasible.
Questions, comments, or improvements? Reach out to email@example.com.