Development and Production gaps in Yocto

(The original article had typos because Grammarly did not work correctly and disabled my browser’s dictionary. My apologies.)

Yocto is great to build customized Linux distributions from source code. It is a very complex tool and needs to build a huge variety of software projects from source code. It often happens that a Linux distribution image needs to be configured differently when building for a developer version vs for production. Often the production version is built in an automated way through a CI(continuous integration) server. This workflow gap between production and developer has been a major driver of developer dissatisfaction. I am going to talk about my insights on what are the parts and drivers of this dissatisfaction but also provide some hints on what works, from my experience.

The usual scenario triggering conflicts is: “It works on my machine but the verification step in the CI fails“. Such a usual scenario can have multiple origins:

  • Build time:
    • The developer is building with different configurations than the CI
    • A given recipe produces different package contents every time it is built
  • Run time:
    • A program has inherent non deterministic behavior, even with the same exact binary. A garden variety bug
    • The verification step is broken

Divergent configurations at build time

This situation is mostly unavoidable in the real world. The most common reason for different configurations are:

  • Debug symbols
  • Debug tools
  • Builds with debug flags enabled
  • Secrets or security only enabled in production

Debug symbols

Debug symbol’s existence is mostly benign and rarely leads to technical issues. Even so, it can result in:

  • Intellectual property leakage. This is not relevant from the point of product operation but can have devastating legal and business impacts
  • Too large images. Some embedded devices do not have sufficient space. Insufficient space can manifest boot failures which are hard to explain and very hard to debug, due to bootloader or flash physical constraints.

Debug tools

Images with debug tools are also pretty common. Unless the debug tools are part of a critical function they also very rarely lead to verification issues. They deserve consideration though, as they can affect Key Performance Indicators(KPIs) of the development, deployment, and physical device unit cost.

The main issues associated with including debug tools vs not, can be summarized as:

  • Excessively long build times. The more tools you put in your image the more recipes and associated packages need to be built and installed. Iteration speed can come to a crawl
  • Very hard real-world diagnostics due to lack of tools/APIs. This is specially true in embedded devices due to their one-purpose nature enforced by cryptographically verified read-only flash. This means that a developer needs to completely regenerate a new image with the diagnostic tools required
  • Increased security risks due to bigger system tampering surface
  • Increased costs due to redundant storage. In volume, extra storage can lead to massive expenses

Builds with debug flags enabled

Debug build flags can significantly impact system and application behavior that result in side effects like the stereotypical “but it works on my machine”:

  • Increased logging and diagnostics has a quality of changing concurrent execution dynamics, often masking issues in concurrent code that are only visible in production
  • Release functionality being dependent on diagnostic flags, which in turn bring unvetted and untested states. Such unvetted situations can disable or bypass security features
  • Different optimization levels exposing or hiding latent bugs result in verification vs developer mismatch
SO I SAW you broke the build but it worked on my machine - Inception Meme |  Meme Generator

This situation is the most damaging from the organizational point of view. In the end, the software is done by people and the feeling that testing is unreliable may lead to the following negative outcomes:

  • QA team vs Dev vs Customer support blame games, leading to increased bug fix turnaround
  • Decreased morale in the developer team due to belief the system is rigged against their productivity
  • Decreased confidence QA team is doing it’s job correctly and that quality checks are meaningful

Secrets only known in production

Generally, the mismatch of secrets in developer and release builds does not result in issues although I had the misfortune of having a situation where it impacted me: In development, it is normal for security systems to be disabled. Unfortunately in production, the addition of an actual key to some bootloader security internals led to a buffer overflow that was very luckily detected on time. My changes were in a completely unrelated part so I did not even know I would trigger it.

Divergent behavior at run time

The following is not necessarily Yocto specific and is valid for software engineering in general.

Normal bugs visible only at release

Often it is hard to know if a given behavior seen for the first time in a production release is due to the different configurations used for release, or if it is just bad luck it became evident in the release. The following lessons I learned:

  • A bug’s manifestation is not the end. Fixing the bug is.
  • The identification of the bug often starts with analyzing the manifestation which can be made difficult by the lack of knowledge of what triggers the bug. This is where development/release mismatch adds another layer of uncertainty and effort.

Incorrect verification/test steps

This one, highlights the old “who tests the tests?” question. My experience is no one, unless:

  • They fail to catch a bug
  • They fail when they should succeed
  • They fail in the release verification flow and succeed in the developer’s flow. The contrary is also true.

I am going to only write about the last point where a flow mismatch leads to different test and verification outcomes.

There should be a single command where all the setup, building, and testing is done owning the solutions to all mismatches between developer and release/CI. This has 2 reasons:

  • A single command means there is a technically enforced agreement and connection between release owners and developers.
  • There is only one way to run verification.

In the real world, it happens that a whole verification flow would be too time-consuming for developers but this fact is already providing insights and venues for improvement:

  • The single verification command logic could be smart enough to detect what tests are required. For example there is no need to build any software if only the testing repository changed
  • Logical steps of the whole verification flows are visible both on the CI dashboard as well as callable locally.
  • A tool where the whole organization can give feedback:
    • The release team giving feedback on verification gaps or reliability
    • The developer team giving feedback on improved modularization

Final notes

This topic became quite bigger than initially intended. It was also a much broader topic than what I wanted to write about, so if something is confusing or you have suggestions let me know as I would like to improve.

Also, my experience is mostly in embedded-devices but I have a feeling that most of it is relevant to other areas of software engineering. It was also not easy to write as it collects many disconnected experiences I had over my career and projects.

How to write to a docker container’s log from another process

I try hard to make my integration tests run in docker containers to guarantee they are not mistakenly dependent on my system and that they test the system as if a real deployment was in place.

Sometimes a test depends on some startup string existing in the logs. An example is that my server emits an administrator temporary password for the first time the server runs. This one-time administrator password is written to stdout and docker logs <container> captures it. Unfortunately for my testing setup, when the container is destroyed and re-setup the logs are reset losing that one-time password needed by the tests. That destroy/re-setup cycle is often manual and not part of the expected flow but it happens frequently due to manual tunings.

I used to just hardcode the temporary password on the tests for when manual iteration was ongoing but this was annoying because I needed to be careful to not commit the temporary change, and because it always meant I always needed to carry over a diff with such workaround. That temporary password would actually be meaningless if it was committed and would break automated actual testing. Instead of changing code with the drawbacks above, I found a way to manually write the temporary password back to the docker logs.

The docker logs --follow command will continue streaming the new output from the container’s STDOUT and STDERR.

https://docs.docker.com/engine/reference/commandline/logs/#extended-description

The question is how can we write to the container’s log if the container’s process itself will not do it?

By default, docker spawns a container with a process ID=1. In Linux, you can write to a process’ file descriptors as long as have the right permissions. stdout is file descriptor number 1. This means:

docker exec -it <container name> bash
echo 'magic string' > /proc/1/fd/1 # Where /proc/$PID=1/fd/fd=1(stdout0

Make sure your container is running to be able to do this, otherwise it will not work:

The docker exec command runs a new command in a running container.

https://docs.docker.com/engine/reference/commandline/exec/

PS: I could not find a primary source documenting that PID = 1 is the default PID for a newly spawned container. Given the meaning of PID=1 in Linux it is baffling or I am making a very wrong assumption.

Scanning and removing secrets from git repositories

Recently I had a situation where I wanted to turn a private git repository, public. Immediately the problem of secrets in the repository’s history was raised. I know that secrets should not be committed but, due to practical reasons it often happens. This means that before publishing a repository its secrets need to be sanitized from all commits.

Finding secrets in a repository with thousands of commits is unpractical if done by hand. To my pleasant surprise, there is an automated tool called GitGuardian which offers the service of scanning repositories for secrets. I found it successfully discovered API keys and PEM keys in a variety of files. I was really amazed, especially as some of the keys were in obscure formats. Even inside Javascript it successfully found API keys.

An example of the dashboard for the scan results of a repository

There are some false positives but the interface is really friendly and one can easily mark some issues to be ignored.

The next step is to either revoke the keys or expunge all secrets from history. Expunging the keys requires the rewriting of history and is potentially destructive but that is the point. To achieve that, I found a StackOverflow post that suggested the use of a tool called git-filter-repo.

To install the tool just run:

# Make sure that user specific binaries are in your PATH
# In my case the following command put git-filter-repo in
# $HOME/.local/bin/
# This means you might need to expand your PATH like
# export PATH=$HOME/.local/bin:PATH
python3 -m pip install --user git-filter-repo

Due to the history modifying purpose of git-filter-repo I advise that the operation be done on a fresh clone set up specifically for the purpose.

git-filter-repo takes a very simple file format even with glob support. I leave an example for glob and non-glob:

glob:Key: "*"==>Key: "@@PLACEHOLDER@@"
9yli7hjr8y0swt1==>XXXX

The first line will replace any occurrence of the string Key: “*” with Key: “@@PLACEHOLDER@@” where the * can be any text.

The second line just turns any literal entry 9yli7hjr8y0swt1 into XXXX

As you can surmise the ==> is the delimiting string.

Finally, keep in mind the following aspects:

  • GitGuardian is not infallible. If letting a secret slip is something you cannot afford then:
    • Do not publish anything at all.
    • Squash all your history into one commit. This will help for manual inspection.
  • You are rewriting history. If you make a mistake in the replace you may permanently damage your repository for your purposes.

Creating an IConfigurationBuilder method extension to load env files

I recently needed some boilerplate code to load env files into IConfigurationBuilder.

Asp.net is notorious for having amazingly fragmented APIs, so I needed 3 static classes, simply for one method that loads the env file configuration and loads it to the configuration.

The extension has the same license as dotnet due to me lifting the code from dotnet itself.

//
// Licensed under the Apache License, Version 2.0.
// See https://github.com/aspnet/Configuration/blob/master/LICENSE.txt for license information.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using DotNetEnv;
using Microsoft.Extensions.Configuration;

namespace DroHub.ProgramExtensions {
    public static class EnvFileExtension {
        public static IConfigurationBuilder LoadFromEnvFile(this IConfigurationBuilder builder,
            string path, string prefix, LoadOptions options) {
            return builder.Add(new EnvFileConfigurationSource(path, prefix, options));
        }
    }

    public class EnvFileConfigurationSource : IConfigurationSource {
        private readonly LoadOptions _load_options;
        private readonly string _path;
        private readonly string _prefix;

        public EnvFileConfigurationSource(string path, string prefix, LoadOptions load_options) {
            _load_options = load_options;
            _path = path;
            _prefix = prefix;
        }

        public IConfigurationProvider Build(IConfigurationBuilder builder)
        {
            return new EnvFileConfigurationProvider(_path, _prefix, _load_options);
        }
    }
    
    public class EnvFileConfigurationProvider : ConfigurationProvider{
        private readonly LoadOptions _options;
        private readonly string _path;
        private readonly string _prefix;

        public EnvFileConfigurationProvider(string path, string prefix, LoadOptions options) {
            _options = options;
            _path = path;
            _prefix = prefix;
        }

        public override void Load() {
            var env = Env.Load(_path, _options).ToDictionary();

            var data = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
            var filtered_variables = env
                .SelectMany(NormalizeKey)
                .Where(entry => ((string)entry.Key).StartsWith(_prefix, StringComparison.OrdinalIgnoreCase));

            foreach (var envVariable in filtered_variables)
            {
                var key = ((string)envVariable.Key)[_prefix.Length..];
                data[key] = (string)envVariable.Value;
            }

            Data = data;
        }

        private static IEnumerable<DictionaryEntry> NormalizeKey(KeyValuePair<string, string> e) {
            var (key, value) = e;
            yield return new DictionaryEntry() {
                Key = NormalizeKey(key),
                Value = value
            };
        }

        private static string NormalizeKey(string key)
        {
            return key.Replace("__", ConfigurationPath.KeyDelimiter);
        }
    }
}

Environmental variables for secrets. A simple threat analysis

One of my favorite ways of passing information to a program without leakage of these secrets is environmental variables. The rationale is:

  1. Reading from a plain text with no permissions set, will lead to any user in the system to be able to read this file which is unnaceptable.
  2. Reading from a plain text file with permissions only to the user running the process, may lead to the file being included by mistake in the version control system. It can also mean that the whole process may call a function to read the file. Better but still easy to make mistakes with.
  3. Passing the secrets as command line arguments leads to their complete exposure. Every user is able to access any process’ command lines. This means that even a lowly privileged user can in theory access /proc/<pid>/cmdline and inspect the secrets passed through command line arguments. This is further illustrated in another post: Accessing a process command line without ps. This is unnaceptable.
  4. The other way is for some environmental variables to be passed to the application. Often the environmental variables are passed from a file or command line which obviously leads to the problems described above. There are safe ways to pass environmental variables besides variables or command line and I will talk about them.

From the above methods, we can extract at the very least 2 requirements:

  1. The secrets should not be accessible to arbitrary parts of the process, at arbitrary states.
  2. The secrets should not be accessible by any other means than through the process.
    • The secret should not be accessible by any other user in the system, except for the root user. Without security enclaves it is not possible to hide anything from a root user. I will give some of the reasons further down.
    • The secret should not be accessible by the process’ user, except through the process that is supposed to use it.

Dockefiles/docker-compose.yml and systemd units are often interpreted in a privileged or sandboxed environment, so they actually provide a medium where you can define and write down your secrets and have them safely passed as environmental variables to the process consuming the secrets. Naturally, you need to make sure that any systemd units or Dockerfiles are not readable by anybody but the root user. This ensures compliance with the second requirement but not with (1).

The non-compliance with the first requirement is where I believe this post has something to offer.

I have seen a lot of support for passing secrets with environmental variables but not much advice on their compartmentalization inside the process. This is especially important because environmental variables are in fact global variables to a process. Any arbitrary process’ state can access environment variables. If for any reason a process is compromised, an attacker can just spawn a process or call the relevant standard library function and get the current process-wide environmental variables set. Most process spawnings pass the current process’ environmental variable so if there is a shell injection issue in your code you will have your secrets exposed.

Similarly, if an attacker can compromise your application so it dumps the current environmental variables your secrets will also be exposed. A common situation is a variable expansion issue like the ones possible in PHP. There $_ENV can be accessed and your secrets revealed. As stated environmental variables are super-global information.

One might argue that both the process spawn and running of environmental variable-related code requires a compromised process and that in such a situation one has lost any ability to protect itself. That is not true. Defense in depth is a thing, and I propose mitigating actions that can fulfill requirement (1):

  • Retrieve the secrets from the environmental variables as early as possible in the startup of your process.
  • Remove or set the the environmental variables with the secrets to blank.
  • Reduce the availability of secrets throughtout your application, making them available only to the parts that need them. This is may be hard to achieve in terms of architecture.

The above actions mean the shell injection attacks will be unable to inspect secrets from the parent process. It also means that in the case of a local code path allowing inspection of global variables or code execution, an inspection of the environmental variables will not reveal any secrets.

Naturally, if your application’s memory is compromised your secrets are revealable, but that is beside the point of this post, as is the case where the root account is compromised. If an attacker is a root user they can inspect arbitrary parts of the system’s memory. Similarly, if internal program memory is compromised an attacker can just read the secrets from memory directly.

In sum, to use environmental variables effectively for secret sharing:

  • Pass secrets as environmental variables from a more priviliged user than the process requiring the secrets.
  • Wipe the environmental variables as soon as you capture them into memory.
  • Keep availability of secrets as restricted as possible.

bash: infinite history and a nice prompt

Recently I deleted my .bashrc and .bash_history by mistake. I lost a good deal of dear commands plus aliases and needed to come up with another .bashrc to get infinite history and a nice prompt.

There are often one-off commands for some very specific tasks that are rarely used but very useful. Such one-offs often are forgotten and not explicitly saved. When they are required though, I am very happy to just make Ctrl-r on my terminal and write a few fragments that I remember and then they come up. It is like my own toolbox. For that, I rely on the fact that my .bash_history is infinite.

The prompt is also very useful in my day-to-day. Often I have very long-running terminal sessions that have commands whose timestamps and exit codes matter to me over the course of my tasks. Timestamps help me get the duration of the command by checking the timestamp of the command vs the next prompt. The exit code helps me understand if the command was successful, normally much later after it ran, and when I no longer remember some task’s details.

In the course of redoing my .bashrc I leaned on this StackOverflow post and a really nice site that allows one to generate custom PS1(prompt). The result is this

# Eternal bash history.
# ---------------------
# Undocumented feature which sets the size to "unlimited".
# http://stackoverflow.com/questions/9457233/unlimited-bash-history
export HISTFILESIZE=
export HISTSIZE=
export HISTTIMEFORMAT="[%F %T] "
# Change the file location because certain bash sessions truncate .bash_history file upon close.
# http://superuser.com/questions/575479/bash-history-truncated-to-500-lines-on-each-login
export HISTFILE=~/.bash_eternal_history
# Force prompt to write history after every command.
# http://superuser.com/questions/20900/bash-history-loss
PROMPT_COMMAND="history -a; $PROMPT_COMMAND"
export PS1="[\t \$?] \h:\w \[$(tput sgr0)\]"

Edit: You may need to manually sanitize your history from time to time if you by mistake input a secret into the command line. Such a secret will be visible in plain text. Model your threats.

Validating global systemd defaults through properties

On the topic of Infinite loops on crashing systemd units, I got to the point where I needed an automatic way to verify that my systems had the correct systemd unit defaults. Let’s assume we have the following configuration:

$ cat /etc/systemd/system.conf
[manager]
DefaultRestartSec=1min
$

It proved to not be straightforward because the documentation at the time was wrong. Quoting it:

For the manager itself, systemctl show will show all available properties. Those properties are documented in systemd-system.conf(5).

https://www.freedesktop.org/software/systemd/man/systemctl.html (2021/11/14)

This meant that in the case of DefaultRestartSec configured above and defined in systemd-system.conf(5), we could get the direct values with the following command:

$ systemctl show -p DefaultRestartSec
$

It returns nothing. Unfortunately, things are not as the manual stated. It also returns exit code 0, indicating there was no error. Luckily when using the above command in a test we would use its output to compare to our desired value and the test would fail.

We established the manual is wrong at the time of writing, but we still have the goal of testing that our system conforms to our mission’s defaults. Let’s check if there are systemd properties that are similar or derived from DefaultRestartSec:

$ systemctl show | grep DefaultRestart
DefaultRestartUSec=60sec
$

Notice the equivalent property! The only difference is that the property’s suffix is USec and not Sec. Also, notice that the suffixed scale on the name of the property or option may be inconsistent with the scale that is set or displayed. While not intellectually friendly, the reason for the different scales in the names has to do with historical and code reasons.

Properties are internally represented as dbus properties and this sets limits on the data types available as well as on backward compatibility. Properties are part of a public API so updating their names or synching their names to the systemd-system.conf options names is not really possible. I tried to inspect systemctl-show.c for a way to match properties to options but that would need hard coding of such mapping making it a maintenance issue.

Options on the other hand have user-facing characteristics so the scale suffix is naturally human-friendly. With that said, given that a system architect can set options in a variety of scales that are automatically converted to internal representations, it is baffling to me why at least time-related configuration options do not match.

With all that said, I submitted an accepted clarification to systemctl‘s manual which summarizes what an architect needs to know when trying to map between systemd‘s configuration options and properties:

systemctl show will show all available properties, most of which are derived or closely match the options described in systemd-system.conf

https://github.com/systemd/systemd/blob/10b1c3cd24f5f95e6e72caebdd6896e2eaf8b853/man/systemctl.xml#L1664

Taking the example of the configuration options mentioned in this post, DefaultRestartSec, DefaultStartLimitIntervalSec and DefaultStartLimitBurst:

DefaultRestartSecDefaultRestartUSec
DefaultStartLimitIntervalSecDefaultStartLimitIntervalUSec
DefaultStartLimitBurstDefaultStartLimitBurst

Accessing a process command line without ps

I work on embedded boards quite a lot, and often the ps command utility is a bare bones busybox stripped down version without options. One of the issues I faced while using it is that it truncates the process command line, and options that would allow it to display an untruncated output are not available.

I remembered that every active process has it’s entry in /proc/<PID>/cmdline and i tried to use it directly. Unfortunately the output of /proc/<PID>/cmdline contains the individual arguments separated by null characters(\000). The \000 character is printed as a @ in my terminal which is unusable to copy/paste. I also did not have a hexadecimal dumper to know what exactly was the character. Fortunately stackoverflow came to the rescue:

cat /proc/$PID/cmdline | tr '\000' ' ' # The solution i would come to
cat /proc/$PID/cmdline | xargs -0 echo # xargs showing it's versatility.

Infinite loops on crashing systemd units.

When using systemd it is often wise to set mission specific “global defaults”. In this article we will look at the case of “how often a unit can fail in a given amount of time” and it’s impacts.

systemd experts will create their units with properties set to the correct values for their specific units, but often these properties are not obvious or are forgotten by the unit’s creators. systemd comes with defaults as explained in systemd-system.conf manual but that does not mean they should not be evaluated for suitability to the system’s constrains and mission.

Let’s approach the requirement where we do not want a crashing service, respawning infinitely fast and thus compromising the whole system. This requirement is valid for any type of computing, be it embedded, server or even a desktop.

systemd gives us a quite a few options to achieve that requirement but let’s focus on DefaultRestartSec, DefaultStartLimitIntervalSec and DefaultStartLimitBurst. These system wide properties can be overridden in individual units, just in the unit files, the Default prefix is not used. From the manual:

DefaultStartLimitIntervalSec=, DefaultStartLimitBurst=

Configure the default unit start rate limiting, as configured per-service by StartLimitIntervalSec= and StartLimitBurst=. See systemd.service(5) for details on the per-service settings. DefaultStartLimitIntervalSec= defaults to 10s. DefaultStartLimitBurst= defaults to 5.

This means that the unit will start at most DefaultStartLimitBurstTimes times per DefaultStartLimitIntervalSec seconds. If a unit start is unsuccessful and matches the criteria described, then the unit will be set as failed and the unit will not start anymore over the defined period. Failed is not disabled. The DefaultRestartSec states how fast to restart the service inside the period defined by DefaultStartLimitIntervalSec.

With good values one can make sure the system will never be starved of resources due to an infinite crash loop. The “factory” default that comes from the systemd code is, 5(times)/10(seconds). This default by itself maybe fine, but i personally find the “factory” DefaultRestartSec=100ms” not very reasonable. In a worst case scenario, this means the unit will fail 5 times in half a second, with a “resting” period of 10 seconds.

Worst case scenario timeline for systemd restart factory defaults. Each letter F is a failure that takes 0.1s. Every 10(DefaultStartLimitIntervalSec) seconds the failure pattern illustrated can repeat itself.

If the unit’s start has side effects on the system, the system can become unusable. From a security point of view this can lead to a denial of service or infrastructure damage. Damage scenarios are:

  • A unit start side effect taking more than 2 seconds to recover from, leaving the system in a death spiral.
  • IO exhaustion, disk space exhaustion or extra infrastructure charges

To sum up, system architects should have a look at the worst-case side effect cleanup/handling of the units running. With this data, define good defaults that allow the system to continue operating degraded even when an infinite crash loop exists. Degraded here means that the service is still operating or has capacity to execute remedying measures.

u-boot exposing mass storage to usb interface

As previously stated, this blog is a place for me to record some notes on things that i learn on my day to day.

One of the things I learned and whished I knew when doing embedded development was how to easily flash certain files in an embedded device’s storage card, without needing to do whole image flashing.

It turns out to be very simple if your embedded board has usb support in u-boot. Running the following command in the u-boot command line:

ums 0 mmc 1

Leads to exposure of mmc1 to the usb-gadget interface 0. See https://u-boot.readthedocs.io/en/latest/usage/ums.html for more details

When successful, your development computer should detect new mass storage devices.