Torizon Updates Technical Overview
Introduction
In this article, you will learn details about key software components used by Torizon OS for providing reliable and secure updates. It covers how OSTree is used for OS updates. It also provides information on how Aktualizr-Torizon implements checks, and rollbacks, and manages the packages and updates according to Uptane standards.
Key Software Components: OSTree and Aktualizr-Torizon
OSTree and Aktualizr-Torizon are complementary and form the foundation for OTA (over-the-air) and offline update capabilities on the device.
- Torizon OS is built with OSTree, a shared library and suite of command-line tools that combines a "git-like" model for committing and downloading bootable filesystem trees, along with a layer for deploying them and managing the bootloader configuration.
- Aktualizr-Torizon is a fork of Aktualizr, a "daemon-like" open-source implementation of the Uptane SOTA standard that secures updates from end to end.
On the server side, Toradex provides a cloud-based hosted solution to provide a complete OTA and offline solution that works with Torizon OS.
This article complies with the Typographic Conventions for Torizon Documentation.
OSTree
Please refer to OSTree article for a brief overview and a demonstration of how to use it.
Our remote and offline Operating Systems updates implementation allows us to update the following components:
- Kernel
- Device tree and device tree overlays
- initramfs
- Root filesystem
/usr
is updated/var
and/home
are ignored, they keep their content over updates/etc
is updated via a 3-way merge
Application updates are handled with containers. Bootloader artifacts can also be updated. Learn more on Software Package Management article.
Uptane
Uptane is a de facto automotive SOTA standard, held by a non-profit consortium named Uptane Alliance under the IEEE/ISTO Federation. Its focus is to enable secure software updates over the air resiliently. It relies on multiple servers to provide security by validating data before a download starts and ensuring that even an offline attack that compromises a single server would still not be enough to compromise the system's security. Uptane is an enhancement to the TUF (The Update Framework) security framework, which is currently a very widely used framework to secure software and package updates on computers and smartphones. The motivations to expand the TUF framework are described in detail in the Uptane Design page and a favorable explanation of TUF is in its docs page Understand the Notary service architecture.
Aktualizr-Torizon
Aktualizr-Torizon is Torizon OS's client implementation of Uptane (forked from Aktualizr's default client). It is written in C++ and its responsibility is to communicate with a Uptane-compatible server. It verifies if new downloads are available, installs those updates on the system, and reports the status to the server while guaranteeing the integrity and confidentiality of OTA and offline updates. Aktualizr handles Docker image updates seamlessly by using Docker Compose YAML files.
How to Use Aktualizr-Torizon
Aktualizr - Modifying the Settings of Torizon OTA Client is a dedicated article that covers the practical aspects, including its usage.
Update Rollbacks
There are cases where the system may fail to boot or the boot process is considered unsuccessful either due to kernel panic or failure to start any critical user-space application. These issues can be handled by developers during development, but it becomes a nightmare if the solution is deployed and such an issue occurs due to any bad update.
OS Updates
Torizon OS and Aktualizr-Torizon are fully capable to recover from unsuccessful OS updates by doing the following:
- Identifying unsuccessful updates and rebooting the device when it occurs.
- Roll back to the previous operating system version after 3 unsuccessful boots.
Identifying Unsuccessful OS Updates
In Torizon OS, the Linux kernel is configured to panic and reboot in case of freezes or crashes. This helps to recover from bad kernel updates.
At the user-space level, systemd hardware watchdog integration is enabled by default in Torizon OS. That means systemd will regularly ping the watchdog hardware, and if systemd (or the kernel) hangs, this ping will not happen anymore and the hardware will automatically reboot the device. This helps to recover from bad updates when the kernel or the initialization daemon (systemd) is not able to run.
Lastly, Torizon OS will consider a successful boot if the boot-complete
systemd target is successfully executed.
This is because the main operating system services required for proper operation, including the Docker daemon, are inside boot-complete.target
.
If boot-complete.target
fails during an update, Torizon OS will automatically reboot.
This helps Torizon OS to recover from bad updates when critical processes from the base operating system don't run as expected.
A Torizon OS user may also define his own "rules" to validate an update using the Greenboot framework. Greenboot (Generic Health Check Framework) is a Fedora project that helps manage systemd services health, and Torizon OS uses Greenboot as a framework to make update checks more flexible and manageable for the user. With Greenboot, you can define a shell script that can do additional checks in the system and force a reboot if needed. For more information about how to use Greenboot, have a look at Update Checks and Rollbacks.
Maintaining a Working OS Version
As mentioned above, Torizon OS will automatically roll back after 3 unsuccessful reboots.
The automatic rollback feature relies on Aktualizr-Torizon’s rollback support and U-Boot's bootcount
feature.
Torizon OS uses Aktualizr-Torizon with rollback_mode
set to uboot_masked
. This enables Aktualizr-Torizon’s U-Boot bootcount
integration:
- After an update, Aktualizr-Torizon enables boot counting by setting U-Boot's environment variables
upgrade_available
to1
andbootlimit
to3
. - In case of a bad update, the system will reboot and U-Boot will increment
bootcount
environment variable. After three times (whenbootcount
is greater thanbootlimit
), the system will roll back to the previously installed OS version. - In case of a good update, Aktualizr-Torizon is normally started and U-Boot environment variables
upgrade_available
andbootcount
are set back to0
.
Torizon over-ther-air and offline updates allow it to roll back to the last installed update thanks to its OSTree based root file system. It also allows keeping multiple deployments (kernel/initramfs/device-tree and the rootfs) on a system and having them bootable. The initial (factory) image has only a single deployment available and is assumed to be a working deployment (no rollback can be done at this point). After the first update has been rolled out, there will be two deployments on the system at all times. If a new deployment fails, the system will automatically roll back to the previous deployment.
When installing an update without Aktualizr-Torizon (e.g. using ostree admin
directly) automatic rollback will not work. To use automatic rollback in a pure OSTree system, those steps need to be executed manually as described in Ostree!
Application Updates
In addition to general OS updates, you can also separately update the containers (your application) on a Torizon OS device. Application updates use the same update framework as OS updates. However, there are relevant differences.
Identifying Unsuccessful Container Updates
Most important among these differences are the conditions for a successful container update. Unlike OS updates, a container update does not require a reboot. This eliminates the possibility of checks at boot. Furthermore, the use cases for containers are far more varied. This makes it difficult to have updated checks that account for all possible cases.
Therefore, Torizon OS performs basic general checks upon update:
- Source the new container images. It runs
docker-compose pull --no-parallel
on the new docker-compose.yml, to pull the new container images in case of a remote online update, or gets the files from the lockbox, in case of an offline update. - Run
docker-compose -p torizon down
on the old docker-compose.yml, to stop and remove any container associated with the old docker-compose.yml - Run
docker-compose -p torizon up --detach --remove-orphans
on the new docker-compose.yml, to bring up the new containers as defined by the parameters of the compose file.
If everything has been successful so far, the new docker-compose.yml overwrites and replaces the old docker-compose.yml file. Finally, the command docker system prune -a --force
cleans up any unused containers, networks, and images from the device.
If any of the above checks “fails", the entire update is considered failed.
The failure mode is defined by the exit code returned from the command. The exit code 0
is considered a success while all other exit codes are failures.
No further checks are made on the state of the container after being started. This can lead to instances where a container starts successfully and soon after exits due to an error. By the above-mentioned checks, this would still be considered a successful update. This means it is important that you verify the status of your containers after they have started.
Maintaining a Working Application Container
After the checks, if exit codes different than 0
are returned, then the previous application docker-compose.yml is used to deploy the last application version, and the failed update is reported in the Torizon Cloud.
This ensures that the last successfully deployed application continues running.
Remote and offline updates rollback checks are not supposed to replace the health monitoring best practices, which would allow containers to be automatically restarted if they stop meeting user-defined health criteria.
Synchronous Updates
If you want to tie the failure and success states of both an OS and application update to one another, then synchronous updates are the solution. Given the requirement to tie the success and failure states of both OS and application updates, the process of a synchronous update differs quite a bit from a standard non-synchronous update. As a user, in order to debug possible update failures, it's important to understand the general process of a synchronous update.
- Update check: The process begins with checking the servers for new updates targeting the device. This is the same as in the non-synchronous case.
- Download: Next, if a new update has been confirmed, then the device begins fetching the images/firmware needed for the update.
- Download failure: If the download phase for OS or application fails, the entire download is considered a failure and the update process stops here.
- OS installation start: The installation phase starts as soon as the download phase succeeds. As with the non-synchronous case, the update will only be finalized on system reboot later on.
- Application installation start: After the OS installation, the application installation starts. The process from this point differs heavily from the non-synchronous case.
- Application pre-installation failure: If this fails, then a flag is set to roll back the OS update on reboot.
- Reboot and OS installation finish: At this point, the device has a new OS update pending on the next boot, and has the new set of container images downloaded. A reboot finalizes both updates.
- OS installation failure: if the OS update seems to have failed after the reboot, Aktualizr triggers a rollback to the previous OS version. Then, it removes the new
docker-compose.yml
and prunes the new container images from the system.
- OS installation failure: if the OS update seems to have failed after the reboot, Aktualizr triggers a rollback to the previous OS version. Then, it removes the new
- Application installation finish: After reboot, if it appears that the OS update has succeeded, then Aktualizr attempts to bring down the current
docker-compose.yml
and bring up the newdocker-compose.yml
.- Application installation failure: If the new
docker-compose.yml
fails to be brought up successfully, Aktualizr removes it and prunes the new container images from the system. A flag is then set telling the system to roll back to the previous OS. A reboot is then triggered to perform the OS rollback. The olddocker-compose.yml
and containers are still in the system, so the rollback set the system to use them.
- Application installation failure: If the new
- Update successful and cleanup: If the new
docker-compose.yml
has been brought up successfully, Aktualizr removes the olddocker-compose.yml
replacing it with the new one. Finally, Aktuaizr prunes the system to clean up old containers and images left over by the previousdocker-compose.yml
.