Poky version-specific variables

From time to time I need some code that is only applicable to a certain version of Yocto. Each Yocto release updates the DISTRO_VERSION variable so we can use that to track poky upgrades in our layer. This is useful because often the Yocto poky distribution is updated independently of other layers.

Imagine a bug fix on newer upstream version that you don’t want to back port to your local poky. Instead, you want to overwrite that code or variable in your own independent layer. To do that you need a bit of python code. Let’s imagine you have want to change ERROR_QA specifically for poky DISTRO_VERSION=4.0.1:

# local.conf
python() {
  # Check if we have the correct DISTRO_VERSION and send a fatal error if we do not.
  if d.getVar("DISTRO_VERSION") != "4.0.1":
    bb.fatal(f"Change of ERROR_QA is not validated for {d.getVar('DISTRO_VERSION')}")

  # set ERROR_QA to a specia value only if Poky is at our desired version
  d.setVar("ERROR_QA", "dev-so")
}

The above code will generate a fatal error if the DISTRO_VERSION is different than what we originally intended. The fatal error makes sure you will need to either remove the workaround or fix it again.

This also works to override entire functions, on your layer, as long as the function name and signature matches.

Generate multiple images at the same time in Yocto

A common task when using Yocto is to create multiple Linux images for a single target’s load. A very simple example is an emergency partition for fallback.

The easiest way is to go to your image recipe and add:

# my-image.bb
do_rootfs[depends] += "my-other-image:do_image_complete

In plain English, this means the task do_rootfs now depends on my-other-image recipe’s image_complete task. All images that inherit the image.bbclass will have the do_rootfs and do_image_complete task, so it is safe for you to depend on it.

Now every time you build your main image you also build “my-other-image“.

PS: A kind of top-level “meta” image recipe could be created that installs nothing and just depends on images so that IMAGE_CMD or wic takes the images and generates a target readable blob. Maybe one day I will post an example of such a recipe.

PSS: If your are thinking of fallback do not invent the wheel and have a look at RAUC.

Development and Production gaps in Yocto

(The original article had typos because Grammarly did not work correctly and disabled my browser’s dictionary. My apologies.)

Yocto is great to build customized Linux distributions from source code. It is a very complex tool and needs to build a huge variety of software projects from source code. It often happens that a Linux distribution image needs to be configured differently when building for a developer version vs for production. Often the production version is built in an automated way through a CI(continuous integration) server. This workflow gap between production and developer has been a major driver of developer dissatisfaction. I am going to talk about my insights on what are the parts and drivers of this dissatisfaction but also provide some hints on what works, from my experience.

The usual scenario triggering conflicts is: “It works on my machine but the verification step in the CI fails“. Such a usual scenario can have multiple origins:

  • Build time:
    • The developer is building with different configurations than the CI
    • A given recipe produces different package contents every time it is built
  • Run time:
    • A program has inherent non deterministic behavior, even with the same exact binary. A garden variety bug
    • The verification step is broken

Divergent configurations at build time

This situation is mostly unavoidable in the real world. The most common reason for different configurations are:

  • Debug symbols
  • Debug tools
  • Builds with debug flags enabled
  • Secrets or security only enabled in production

Debug symbols

Debug symbol’s existence is mostly benign and rarely leads to technical issues. Even so, it can result in:

  • Intellectual property leakage. This is not relevant from the point of product operation but can have devastating legal and business impacts
  • Too large images. Some embedded devices do not have sufficient space. Insufficient space can manifest boot failures which are hard to explain and very hard to debug, due to bootloader or flash physical constraints.

Debug tools

Images with debug tools are also pretty common. Unless the debug tools are part of a critical function they also very rarely lead to verification issues. They deserve consideration though, as they can affect Key Performance Indicators(KPIs) of the development, deployment, and physical device unit cost.

The main issues associated with including debug tools vs not, can be summarized as:

  • Excessively long build times. The more tools you put in your image the more recipes and associated packages need to be built and installed. Iteration speed can come to a crawl
  • Very hard real-world diagnostics due to lack of tools/APIs. This is specially true in embedded devices due to their one-purpose nature enforced by cryptographically verified read-only flash. This means that a developer needs to completely regenerate a new image with the diagnostic tools required
  • Increased security risks due to bigger system tampering surface
  • Increased costs due to redundant storage. In volume, extra storage can lead to massive expenses

Builds with debug flags enabled

Debug build flags can significantly impact system and application behavior that result in side effects like the stereotypical “but it works on my machine”:

  • Increased logging and diagnostics has a quality of changing concurrent execution dynamics, often masking issues in concurrent code that are only visible in production
  • Release functionality being dependent on diagnostic flags, which in turn bring unvetted and untested states. Such unvetted situations can disable or bypass security features
  • Different optimization levels exposing or hiding latent bugs result in verification vs developer mismatch
SO I SAW you broke the build but it worked on my machine - Inception Meme |  Meme Generator

This situation is the most damaging from the organizational point of view. In the end, the software is done by people and the feeling that testing is unreliable may lead to the following negative outcomes:

  • QA team vs Dev vs Customer support blame games, leading to increased bug fix turnaround
  • Decreased morale in the developer team due to belief the system is rigged against their productivity
  • Decreased confidence QA team is doing it’s job correctly and that quality checks are meaningful

Secrets only known in production

Generally, the mismatch of secrets in developer and release builds does not result in issues although I had the misfortune of having a situation where it impacted me: In development, it is normal for security systems to be disabled. Unfortunately in production, the addition of an actual key to some bootloader security internals led to a buffer overflow that was very luckily detected on time. My changes were in a completely unrelated part so I did not even know I would trigger it.

Divergent behavior at run time

The following is not necessarily Yocto specific and is valid for software engineering in general.

Normal bugs visible only at release

Often it is hard to know if a given behavior seen for the first time in a production release is due to the different configurations used for release, or if it is just bad luck it became evident in the release. The following lessons I learned:

  • A bug’s manifestation is not the end. Fixing the bug is.
  • The identification of the bug often starts with analyzing the manifestation which can be made difficult by the lack of knowledge of what triggers the bug. This is where development/release mismatch adds another layer of uncertainty and effort.

Incorrect verification/test steps

This one, highlights the old “who tests the tests?” question. My experience is no one, unless:

  • They fail to catch a bug
  • They fail when they should succeed
  • They fail in the release verification flow and succeed in the developer’s flow. The contrary is also true.

I am going to only write about the last point where a flow mismatch leads to different test and verification outcomes.

There should be a single command where all the setup, building, and testing is done owning the solutions to all mismatches between developer and release/CI. This has 2 reasons:

  • A single command means there is a technically enforced agreement and connection between release owners and developers.
  • There is only one way to run verification.

In the real world, it happens that a whole verification flow would be too time-consuming for developers but this fact is already providing insights and venues for improvement:

  • The single verification command logic could be smart enough to detect what tests are required. For example there is no need to build any software if only the testing repository changed
  • Logical steps of the whole verification flows are visible both on the CI dashboard as well as callable locally.
  • A tool where the whole organization can give feedback:
    • The release team giving feedback on verification gaps or reliability
    • The developer team giving feedback on improved modularization

Final notes

This topic became quite bigger than initially intended. It was also a much broader topic than what I wanted to write about, so if something is confusing or you have suggestions let me know as I would like to improve.

Also, my experience is mostly in embedded-devices but I have a feeling that most of it is relevant to other areas of software engineering. It was also not easy to write as it collects many disconnected experiences I had over my career and projects.