BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Debugging Production: eBPF Chaos

Debugging Production: eBPF Chaos

This item in japanese

Key Takeaways

  • eBPF helps with access to observability data in microservice container environments that are otherwise hard to fetch.
  • Developers benefit from auto-instrumentation for performance monitoring, profiling, tracing.
  • Verification of tools and platforms is required, breaking production with chaos engineering to verify eBPF enabled workflows.
  • Security observability with eBPF is a cat and mouse game, with malicious actors learning to circumvent security policies.
  • Chaos engineering can benefit from using eBPF probes to manipulate behavior.

This article shares insights into learning eBPF as a new cloud-native technology which aims to improve Observability and Security workflows. The entry barriers can feel huge, and the steps to using eBPF tools to help debug in production can be many. Learn how to practice using existing tools and tackle challenges with observability storage, alerts and dashboards. The best tools are similar to a backup - if not verified working, they are useless. You’ll learn how chaos engineering can help, and get an insight into eBPF based observability and security use cases. Breaking them in a professional way also inspires new ideas for chaos engineering itself. Work to be done, and risks are discussed too, followed by a wishlist for future debugging in production improvements.

Getting started with eBPF

There are different ways to start using eBPF programs and tools. It can feel overwhelming with the many articles and suggestions. As a first step, define a use case and problem to solve. Would it be helpful to get more low-level monitoring metrics to troubleshoot product incidents faster? Maybe there is uncertainty about the security in a Kubernetes cluster; are there ways to observe and mitigate malicious behaviors? Last but not least, consider microservice container environments a good way to inspect behavior and data - often it is complicated to gain access, compared to reading files on a virtual machine running with a monolithic application.

Debug and troubleshoot production: eBPF use cases

Let's look into practical use cases and tools to get inspired for debugging situations, and also figure out how to verify that they are working properly.

Observability can benefit from eBPF with additional data collection that gets converted into metrics and traces. Low-level kernel metrics can be collected with a custom Prometheus Exporter, for example. A more developer observability-focused approach is auto-instrumentation of source code to gain performance insights into application;  Pixie provides this functionality through a scriptable language for developers. Coroot implements Kubernetes service maps using eBPF, tracking the traffic between containers, and provides Ops/SRE production health insights. Continuous Profiling with Parca is made possible with function symbol unwinding techniques to trace calls and performance inside the application code.

Security is another key area where eBPF can help: detecting unauthorized syscalls,  connections that call home, or event syscalls hooked by malicious actors like Bitcoin miners or rootkits. A few examples are: Cilium provides security observability, and prevention with Tetragon. Tracee from Aqua Security can be used on the CLI and in CI/CD, to detect malicious behavior. Falco is a Kubernetes threat detection engine, and is likely the most mature solution in the cloud-native ecosystem. A special use case was created by the GitLab security team to scan package dependency install scripts (package.json with NodeJS for example) for malicious behavior to help prevent supply chain attacks.

SRE/DevOps teams require the right tools to debug fast, and see root causes on a more global view. Toolsets like Inspektor Gadget help in Kubernetes to trace outgoing connections, debug DNS requests, and more. Caretta can be used to visualize service dependency maps. Distributing eBPF programs in a safe way also requires new ideas: Bumblebee makes it easier by leveraging the OCI container format to package the eBPF programs for distribution.

Observability challenges

There are different observability data types that help debugging production incidents, analyze application performance, generally get to understand known questions, and see potential unknown unknowns. This requires different storage backends, leading to DIY fatigue. With additional types and sources introduced with eBPF probe data which cannot be converted to existing data, this can get more complex. Unified Observability datastores will be needed, scaling to store a lot of data.

What data do I really need to solve this incident? Troubleshoot a software regression? Finding the best retention period is hard. Self-hosting the storage backend is hard, too, but SaaS might be too expensive. Additionally, cost efficiency and capacity planning will be needed. This can also help estimate the storage growth for Observability data in the future. The GitLab infrastructure team built the open-source project Tamland to help with capacity planning and forecasting.

More benefits and insights gained from eBPF data also requires integration into alerts and dashboards. How to reduce the amount of public facing incidents, and detect and fix problems with more insights available? The overall health state, cross-references to incident management data, and "all production data" is needed to be able to run anomaly detection, forecast and trend calculations - because in the world of microservices, there is no answer to, "Is my service OK?" anymore.

A green dashboard with all thresholds showing "OK" does not prove that alerts are working, or that critical situations can be resolved fast enough from dashboard insights. The data collection itself might be affected or broken too, an eBPF program not behaving as intended, or a quirky kernel bug hitting production. This brings new and old ideas to simulate production problems: Chaos Engineering. Break things in a controlled way, verify the service level objectives (SLOs), alerts and dashboards. The chaos frameworks can be extended with own experiments, for example Chaos Mesh for cloud native chaos engineering, or the Chaos Toolkit integrated into CI/CD workflows. The patterns from chaos engineering can also be used for injecting unexpected behavior, and running security tests, because everything behaves differently with random chaos, even security exploits.

Let’s break eBPF Observability

Tools and platforms based on eBPF provide great insights, and help debugging production incidents. These tools and platforms will need to prove their strengths and unveil their weaknesses, for example, by attempting to break or attack the infrastructure environments and observe the tool/platform behavior. At a first glance, let’s focus on Observability and chaos engineering. The Golden Signals (Latency, Traffic, Errors, Saturation) can be verified using existing chaos experiments that inject CPU/Memory stress tests, TCP delays, DNS random responses, etc.

Another practical example is the collection of low-level system metrics which includes CPU, IO, memory. This can be performed using the eBPF events to Prometheus exporter. In order to verify the received metrics, existing chaos experiments for each type (CPU, IO, memory stress, delays) can help to see how the system behaves, and if the collected data is valid.

[Click on the image to view full-size]

Developers can benefit from Pixie which provides auto-instrumentation of application code, and also creates service maps inside a Kubernetes cluster. In order to verify the maps showing correct data, and the traces showing performance bottlenecks, add chaos experiments that cause stress tests and network attacks. Then it is possible to specifically see how the service maps, and traces change over time, and take action on identified problematic behavior, before a production incident with user facing problems unveils them.

[Click on the image to view full-size]

For SREs, Kubernetes troubleshooting can be helped by installing the Inspektor Gadget tool collection to overcome the limitations of container runtime blackboxes. Inspektor Gadget uses eBPF to collect events and metrics from container communication paths, and maps low-level Linux resources to high-level Kubernetes concepts. There are plenty of gadgets available for DNS, network access, and out-of-memory analysis. The tool’s website categorizes them into advise, audit, profile, snapshot, top and trace gadgets. You can even visualize the usage and performance of other running eBPF programs using the "top ebpf" gadget. The recommended way of testing their functionality is to isolate a tool/command, and run a chaos experiment that matches, for example a DNS chaos experiment that returns random or no response to DNS requests.

More visual Kubernetes troubleshooting and observability can be achieved by installing Coroot, and using its service maps auto-discovery feature. The visual service map uses eBPF to trace container network connections, and aggregates individual containers into applications by using metrics from the kube-state-metrics Prometheus exporter. The service map in Coroot is a great target for chaos experiments - consider simulating broken or delayed TCP connections that influence services to vanish from the service map, or increase the bandwidth with network attack simulation and verify the dashboards under incident conditions. OOM kills can be detected by Coroot too, using the underlying Prometheus monitoring metrics - a perfect candidate for applications that leak memory. A demo application that leaks memory but only when DNS fails, is available in my "Confidence with Chaos for your Kubernetes Observability" talk to test this scenario specifically.

[Click on the image to view full-size]

Continuous Profiling with Parca uses eBPF to auto-instrument code, so that developers don’t need to modify the code to add profiling calls, helping them to focus. The Parca agent generates profiling data insights into callstacks, function call times, and generally helps to identify performance bottlenecks in applications. Adding CPU/Memory stress tests influences the application behavior, can unveil race conditions and deadlocks, and helps to get an idea of what we are actually trying to optimize.

OpenTelemetry supports metrics next to traces and logs as data format specification. There is a new project that provides eBPF collectors from the Kernel, inside a Kubernetes cluster, or on a hyper cloud. The different collectors send the metrics events to the Reducer (data ingestor), which either supports providing the metrics as a scrape endpoint for Prometheus, or sends them using gRPC to an OpenTelemetry collector endpoint. Chaos experiments can be added in the known ways: stress testing the systems to see how the metrics change over time.

[Click on the image to view full-size]

Last but not least - some use cases involve custom DNS servers running as eBPF programs in high performance networks. Breaking DNS requests can help shed light into their behavior too - it is always DNS.

Changing sides: breaking eBPF security

Let’s change sides and try to break the eBPF security tools and methods. One way is to inject behavioral data that simulates privilege escalation, and observe how the tools react. Another idea involves exploiting multi-tenancy environments that require data separation, by simulating unwanted access.

Impersonating the attacker is often hard, and when someone mentioned "tracing syscalls, hunting rootkits event", this got my immediate attention. There are a few results when searching for Linux rootkits, and it can be helpful to understand their methods to build potential attack impersonation scenarios. Searching the internet for syscall hooking leads to more resources, including a talk by the tracee maintainers about hunting rootkits with tracee with practical insights using the Diamorphine rootkit.

Before you continue reading and trying the examples, don’t try this in production. Create an isolated test VM, download and build the rootkit, and then load the kernel module. It will hide itself and do everything to compromise the system. Delete the VM after tests.

Calling the Tracee CLI was able to detect the syscall hooking. The following command allows running tracee in Docker itself, in a privileged container that requires a few mapped variables and the event to trace, 'hooked_syscalls'.

$ docker run \
  --name tracee --rm -it \
  --pid=host --cgroupns=host --privileged \
  -v /etc/os-release:/etc/os-release-host:ro \
  -v /sys/kernel/security:/sys/kernel/security:ro \
  -v /boot/config-`uname -r`:/boot/config-`uname -r`:ro \
  -e LIBBPFGO_OSRELEASE_FILE=/etc/os-release-host \
  -e TRACEE_EBPF_ONLY=1 \
  aquasec/tracee:0.10.0 \
  --trace event=hooked_syscalls

The question is how to create a chaos experiment from a rootkit? It is not a reliable chaos test for production environments, but the getdents syscall hooking method from the Diamorphine rootkit could be simulated in production, to verify if alarms are triggered.

Cilium Tetragon works in similar ways to detect malicious behavior with the help of eBPF, and showed new insights into the rootkit behavior. The detection and rules engine was able to show that the rootkit’s overnight activities expanded to spawning random name processes on a given port.

$ docker run --name tetragon \
   --rm -it -d --pid=host \
   --cgroupns=host --privileged \
   -v /sys/kernel/btf/vmlinux:/var/lib/tetragon/btf \
   quay.io/cilium/tetragon:v0.8.3 \
   bash -c "/usr/bin/tetragon"

[Click on the image to view full-size]

Let’s imagine a more practical scenario: a bitcoin miner malware that runs in cloud VMs, Kubernetes clusters in production, and in CI/CD deployment environments for testing, staging, etc. Detecting these patterns is one part of the problem - intrusion prevention is another, unanswered question yet. Installing the rootkit as a production chaos experiment is still not recommended - but mimicking the syscall loading overrides in an eBPF program can help test.

This brings me to a new proposal: create chaos test rootkits that do nothing but simulation. For example, hooking into the getdents syscalls for directory file listings, and then being able to verify all security tools hopefully detecting the simulated security issue. If possible, simulate more hooking attempts from previously learned attacks. This could also be an interesting use case for training AI/ML models, and provide additional simulated attacks to verify eBPF security tools and platforms.

How Chaos engineering can benefit from eBPF

While working on my QCon London talk, I thought of eBPF as a way to collect and inject data for chaos experiments. If eBPF allows us to access low-level in-kernel information, we can also change the data and simulate production incidents. There is a research paper about Maximizing error injection realism for chaos engineering with system calls, introducing the Phoebe project. It captures and overrides system calls with the help of eBPF.

Existing chaos experiments happen on the user level. DNS Chaos in Chaos Mesh for example is injected into CoreDNS which handles all DNS requests in a Kubernetes cluster. What if there was an eBPF program sitting on the Kernel level that hooks into DNS requests before they reach the user space? It can perform DNS request analysis and inject chaos with returning wrong responses for the resolver requests. Some work already has been done with the Xpress DNS project, which is an experimental DNS server written in BPF for high throughput and low latency DNS responses. The user space application can add/change DNS records to a BPF map which is read by the kernel eBPF program. This can be an entry point for a new chaos experiment with DNS and eBPF.

Summarizing all the ideas with eBPF chaos injection, new chaos experiments can be created to simulate rootkit behavior and call home to verify security observability and enforcement. Intercepting traffic to cause delays or wrong answers for TCP/UDP and DNS, and CPU stress ideas can help verify reliability and observability.

Chaos eBPF: we’ve got work to do

The advantages and benefits of eBPF sound clear, but what is missing, where do we need to invest in the future? Which risks should we be aware of?

One focus area is DevSecOps and SDLC with treating eBPF programs as any code that needs to be compiled, tested, validated against code quality checks and security scanning, and potential performance problems. We also need to avoid potential supply chain attacks. Given the complex nature of eBPF, users will follow installation guides and may apply curl | bash command patterns without verifying what will happen on a production system.

Testing eBPF programs automatically in CI/CD pipelines is tricky, because the kernel verifies the eBPF programs at load time and rejects potential unsafe programs. There are attempts to move the eBPF verifier outside of the kernel, and allow testing eBPF programs in CI/CD.

There are risks with eBPF, and one is clearly: root access to everything on the kernel level. You can hook into TLS encrypted traffic just after the TLS library function calls have the raw string available. There are also real world exploits, rootkits and vulnerabilities that are using eBPF to bypass eBPF. Some research has been conducted to use special programming techniques for exploits and data access, which go undetected by eBPF security enforcement tools. The cat and mice game will continue ...

A wishlist for eBPF includes:

  • Sleepable eBPF programs, to pause the context, and continue at a later point (called "Fiber" in various programming languages).
  • eBPF programs that observe other eBPF programs for malicious behavior, similar to monitor-the-monitor problems in Ops life.
  • More getting started guides, learning resources, and also platform abstraction. This reduces the entry barrier into this new technology so that everyone can contribute.

Conclusion

eBPF is a new way to collect Observability data; it helps with network insights and security observability and enforcement. We can benefit from debugging in production incidents. Chaos engineering helps to verify observability and eBPF programs, and new ideas for eBPF probes in chaos experiments will allow it to iterate faster. Additionally, we are able to benefit from more data sources beyond traditional metrics monitoring - correlate, verify and observe production environments. This helps the path to DataOps, MLOps/AIOps - AllOps.

Developers can benefit from auto-instrumentation for Observability driven development; DevOps/SRE teams verify reliability with chaos engineering, and DevSecOps will see more cloud-native security defaults. eBPF program testing and verification in CI/CD is a big to-do, next to bringing all ideas upstream and lowering the entry barrier to using and contributing to eBPF open-source projects.

About the Author

BT