Device Monitoring in Torizon OS
Introductionβ
Device monitoring with Torizon encompasses several different areas of functionality related to understanding the health, status, and performance of your devices.
When we think about device monitoring, we break it down into three types of data: metrics, logs, and alerts. A metric is a numerical value that we can measure and report on a regular interval over time, like memory usage, CPU temperature, or custom data from our users' applications. Logs are just the log output that various parts of the system produce, including docker container logs, kernel logs, and application logs from journald. Alerts are special events or errors that the device wants to raise in real-time because they require attention or remediation, like when a critical application fails to launch, or the device is running out of storage space.
A monitoring agent is available that can collect all three types of data--metrics, logs, and alerts--and send it either to the Torizon Cloud Web Interface, or independently to other external services. Today, the Torizon Cloud supports metrics of all types. Out of the box, Torizon OS will report some basic system info, but you can also create and send your own custom metrics to build dashboards showing whatever data is most important to you. Log forwarding and real-time alerting are not available yet, but are planned for the future.
When investigating the best option for the monitoring agent in Torizon OS, we looked for an option that would be modular, event-driven, based on known and widely adopted standards, open-source, and with acceptable performance and resource usage on resource-constrained devices. Additionally, we wanted it to be flexible enough to handle all three types of device monitoring data. After considering all the options, we chose Fluent Bit as our monitoring agent.
Fluent Bitβ
Fluent Bit is an open-source log processor and forwarder, which allows to collect any data like metrics and logs from different sources (hardware and software), enrich them with filters, and send them to multiple destinations.
In Fluent Bit, information is processed in a pipeline, with a very pluggable architecture. Data is collected with input plugins, filtered with filter plugins, and sent to remote servers with output plugins:
- Input plugins: gather and parse information from different sources (CPU, disk, memory, network, temperature, processes, kernel, logs, etc).
- Filter plugins: allow altering the data before delivering it to some destination (remove, add, change, nest, etc).
- Output plugins: allow defining a destination for the data (Prometheus, Amazon, Azure, Google Cloud, Datadog, Elasticsearch, HTTP(S), Kafka, etc).
Fluent Bit is written in C and designed with performance in mind (high throughput with low CPU and memory usage).
For more information about the project, see the Fluent Bit official documentation.
This article complies with the Typographic Conventions for the Toradex Documentation.
In this article, you will need to execute the commands as root. You can log in as root or (better) use the sudo
command when logged in with a regular user to do it.
Prerequisitesβ
- A Toradex SoM with Torizon OS installed
- An account on the Torizon Cloud
Device Monitoring Implementation in Torizon OSβ
Fluent Bit is integrated and enabled in Torizon OS. By default, it's configured to monitor CPU, memory, temperature, the docker daemon, and information related to eMMC health, and send the information to the Torizon Cloud.
When you first boot Torizon OS, the Fluent Bit service will not start due to the absence of the /etc/fluent-bit/enabled
file:
# systemctl status fluent-bit
* fluent-bit.service - Fluent Bit
Loaded: loaded (/usr/lib/systemd/system/fluent-bit.service; enabled; vendor preset: enabled)
Active: inactive (dead)
Condition: start condition failed at Tue 2021-09-14 07:42:15 UTC; 5h 19min ago
`- ConditionPathExists=/etc/fluent-bit/enabled was not met
Sep 14 07:42:15 colibri-imx6-10492785 systemd[1]: Condition check resulted in Fluent Bit being skipped.
As soon as the device is provisioned to the Torizon Cloud, the provisioning script will enable the Fluent Bit service by creating the /etc/fluent-bit/enabled
file (unless you choose to disable it). After that, Fluent Bit will start collecting data and sending it to the Torizon Cloud. The data transport is secured using the same TLS credentials as Aktualizr-Torizon. It then send this data via an API endpoint to Torizon Cloud.
Enabling Device Monitoringβ
Device monitoring is already enabled by default in Torizon OS. The only thing you need to do is provision the device to the Torizon Cloud and make sure the "Enable device metrics" box is checked. No extra steps are required.
Device monitoring is managed differently in previous versions. Please refer to theΒ TorizonCore 5 Documentation to learn about it.
Disabling Device Monitoringβ
Disabling device monitoring can be easily done by simply disabling the Fluent Bit service:
# systemctl stop fluent-bit
# systemctl disable fluent-bit
Customizing Device Metrics for Torizon Cloudβ
Torizon OS 6.4.0 and earlier versions sent device monitoring data to Torizon Cloud via Aktualizr-Torizon rather than directly. In these versions there were reports of data being dropped when sending multiple custom metrics within a short timeframe. If you experience this please update to the latest Torizon OS version and try there.
Torizon OS ships with a default configuration file for Fluent Bit that allows it to collect five basic metrics:
- CPU usage
- Memory/swap usage
- CPU core temperature
- Docker daemon status
- eMMC health information (only available in Torizon OS 6.5.0 and newer)
However, you can modify this default configuration to send customized metrics, such as metrics from your own applications or from sensors connected to your board.
The Torizon Cloud can accept metrics of any kind, as long as they are formatted properly. In this section, you'll learn how to add a new input plugin to fluent bit, add a filter plugin that formats the data for the Torizon Cloud, and start sending data. If your use case isn't covered here, you can always consult the official documentation of Fluent Bit to learn how to do more.
The default Fluent Bit configuration (shown above and available on your board in /etc/fluent-bit/fluent-bit.conf
) has five key parts:
- A
[SERVICE]
section containing basic Fluent Bit options. - Several
[INPUT]
sections enabling input plugins. - Several
[FILTER]
sections formatting the data that those input plugins produce. - An
[OUTPUT]
section that tells Fluent Bit how to send the data to Torizon Cloud. - An
include
that includes user-developed configuration files (starting with Torizon OS v7.0.0) of the form `/etc/fluent-bit/fluent-bit.d/custom-*.conf.
The included files are unsorted so if you need to enforce ordering you will need to manually include your files in the specific order in /etc/fluent-bit/fluent-bit.conf
Users migrating from v6.x to v7.y are advised to remove their customizations from /etc/fluent-bit/fluent-bit.conf
and store them in an include
file. Doing so will eliminate the need to merge changes in /etc/fluent-bit/fluent-bit.conf
on each release of Torizon.
Additionally, interval_sec
is one important parameter that defines the period, in seconds, between samples in [INPUT]
sections.
The time interval for the metrics should not be less than 10 seconds. Although metrics can work with an interval of fewer than 10 seconds, this may result in unexpected and unreliable behavior.
To add custom metrics, we don't need to change [SERVICE]
or [OUTPUT]
; we just need to add a new input plugin and filter. We need to set up our config file so that we get JSON-formatted output that looks like this:
{
"custom": {
"my_metric_1": 123.4,
"my_metric_2": 567.8
}
}
The specific requirements are:
- The nested object must be named
custom
- It may contain any number of name/value pairs
- For each name/value pair, the value must be a number--i.e., no nested objects, strings, arrays, or booleans
- The name of the name/value pair will be the name of the metric that appears on the Torizon Cloud Web Interface
The simplest way to do this is with an input plugin that accepts raw JSON as an input, like the HTTP input plugin, and nest it under the custom
key with the Nest filter plugin.
On Torizon OS 7, you can create a new file in /etc/fluent-bit/fluent-bit.d
starting with the string custom-
containing your changes. For example, try creating a new file named /etc/fluent-bit/fluent-bit.d/custom-mymetric.conf
containing the lines below. Then restart Fluent Bit (systemctl restart fluent-bit
):
[INPUT]
name http
host localhost
port 9999
[FILTER]
Name nest
Match custom
Operation nest
Wildcard *
Nest_under custom
This will make Fluent Bit listen on port 9999 for HTTP POST requests, and any custom metrics you send to http://localhost/custom
will be nested inside the custom
object--exactly what we need.
You can try it out by manually sending some JSON data to that port using curl. For example, the following curl command will send exactly the metrics given in the example json above:
# curl -X POST http://localhost:9999/custom \
-H 'Content-Type: application/json' \
-d '{"my_metric_1":123.4,"my_metric_2":567.8}'
This is just one basic example. As long as you have an input plugin that gives you the data you want, and a filter (or series of filters) to put the data into the format required, you can build in all kinds of metric reporting. A few other inputs that might be of interest:
- Serial interface for pulling data from a simple serial connection
- Process Metrics for reporting on one particular process
- Exec to periodically run a command and parse the output
Each of these will require some kind of filtering to get the data into the required format; for more information consult the Fluent Bit official documentation.
Example: Disk Usage Custom Metricβ
One common metric to monitor is disk usage. Fluent bit doesn't offer an input plugin for this, but it's easy enough to get using the df
command, the Exec input plugin, and a simple filter.
df -k
gives us the total, used, and available space in kilobytes:
# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
tmpfs 1898992 25824 1873168 2% /run
devtmpfs 1384624 0 1384624 0% /dev
/dev/disk/by-label/otaroot 15226800 1687972 12745640 12% /sysroot
Adding a | grep otaroot
gives us just the disk we're looking for, and either awk
or jq
can give us the rest of what we need to parse that into JSON:
- jq example
# df -k | grep otaroot | jq -R -c -s 'gsub(" +"; " ") | split(" ") | { "otaroot_total": .[1], "otaroot_used": .[2], "otaroot_avail": .[3]}'
{"otaroot_total":15226800,"otaroot_used":1687972,"otaroot_avail":12745640}
- awk example
# df -k | grep otaroot | awk '{print "{\"otaroot_total\":" $2 ",\"otaroot_used\":" $3 ",\"otaroot_avail\":" $4 "}"}'
{"otaroot_total":15226800,"otaroot_used":1687972,"otaroot_avail":12745640}
Now add one of these commands to the fluent-bit configuration using the Exec input plugin and the Nest filter plugin. Disk usage doesn't change that frequently, so we'll only report these metrics once per hour by setting Interval_Sec
to 3600. On Torizon OS 7, create a new file named /etc/fluent-bit/fluent-bit.d/custom-otaroot.conf
to contain that content.
[INPUT]
Name exec
Tag disksize
Command df -k | grep otaroot | jq -R -c -s 'gsub(" +"; " ") | split(" ") | { "otaroot_total": .[1], "otaroot_used": .[2], "otaroot_avail": .[3]}'
Parser json
Interval_Sec 3600
[FILTER]
Name nest
Match disksize
Operation nest
Wildcard *
Nest_under custom
Restart the fluent-bit service:
# systemctl restart fluent-bit
Your device will now be reporting metrics named otaroot_total
, otaroot_used
, and otaroot_avail
to the Torizon Cloud, and you can begin creating custom charts with those metrics.
Connecting Fluent Bit to Other Data Platformsβ
Fluent Bit is a flexible tool that can send data to various other platforms. It is compatible with the most popular Cloud providers and protocols (AWS, Microsoft Azure, Google Cloud, Datadog, Elasticsearch, etc). For more information about how to configure Fluent Bit, see the official documentation. Note that Fluent Bit can output to multiple different sources, if you wish, so you can send device metrics to the Torizon Cloud for monitoring, and use another platform for log storage, for example.
Application Log Monitoringβ
As mentioned above, logs are another important type of device monitoring. Although Torizon Cloud does not yet support log aggregation, you can still use Fluent Bit to process and forward various types of logs, using another data sink for analyzing the data.
One common use case for Torizon OS is the monitoring of containers running user applications. Combined with the flexibility of Fluent Bit, this enables use cases like data aggregation, diagnostics, or basic monitoring of applications running on Torizon. The only requirement for this setup is that your applications must be running within a Docker container.
Application log monitoring not yet available on the Torizon Cloud Web Interface. Therefore the receiving and processing of this data are up to the customer implementation.
Enabling Container Monitoringβ
To monitor Docker containers, we utilize the built-in Fluentd logging driver in Docker. This allows the logs for Docker Containers to be sent to the Fluentd collector as structured log data. Fluentd is an open-source data collector for unified logging, similar to Fluent Bit.
In order to enable Fluentd logging for Docker containers, there are two methods:
- System-wide - setting Fluentd as the default logging driver for all Docker containers: you must edit the default Docker configuration. On Torizon OS this config file is located at
/etc/docker/daemon.json
, you may need to create this file if it does not already exist. To enable Fluentd logging via config file please consult the Docker documentation, to see the correct syntax as well as possible options for the config file. - Per-container basis - adding an additional flag to your container start up method: you can use it with either
docker run
ordocker-compose
. Fordocker run
you need just add--log-driver=fluentd
. Fordocker-compose
consult the Docker documentation. This method is useful if you only want to monitor specific containers.
Configuring Fluent Bit for Container Monitoringβ
Now Fluent Bit must be configured correctly to accept the data coming from the Fluentd logger in Docker. Fortunately Fluentd and Fluent Bit are compatible and complementary tools. All that's needed is to utilize the "Forward" input plugin for Fluent Bit. This input plugin is designed to accept data from a Fluentd stream. The only thing to be careful of is to make sure that both Docker and Fluent Bit are configured to use the same TCP ports.
As for output streams, as mentioned above our Torizon Cloud does not yet display or accept log data, though this feature is planned for the future. Therefore how the data is received and processed is up to user discretion.
Exampleβ
Open 2 terminals to your Torizon OS device. On the first terminal start up Fluent Bit:
# fluent-bit -i forward -o stdout -p format=json_lines -f 1
For demonstration, the output will just be set to stdout on the device.
Next, on the second terminal run a container like so:
# docker run --log-driver=fluentd debian echo "Testing a log message"
You should now see the following output in the first terminal running Fluent bit:
{"date":1636585969.0,"source":"stdout","log":"Testing a log message","contain er_id":"eaf3a3e80ba3dd778ffd4c1057481cd889cbde75ed7b94ce3dddd5cc462a7c98","co ntainer_name":"/gracious_driscoll"}
the Fluentd Docker logging driver only captures data from the container's stdout or stderr. Keep this in mind for containers that you would like to monitor with this method.
Data Bufferingβ
One more critical feature of Fluent Bit is buffering. In this context buffering refers to the ability for Fluent Bit to hold and store data when it is unable to be delievered to the output. For example if the device is offline or Torizon Cloud can not be reached by the device. With buffering configured, data that would normally be lost during such an event could be retained and delievered once the device goes back online.
Buffering for the default metrics in Torizon OS is enabled by default starting from Torizon OS 6.5.0. See the /etc/fluent-bit/fluent-bit.conf
file to see the default configurations for buffering.
The defaults settings for buffering are just conservative settings to show off and give an example of buffering without causing a large impact on development. If you wish to actively make use of buffering, then it is recommended to modify the configuration settings to fit your specific use-case and needs.
For in-depth information about buffering and how to configure it, refer to Fluent Bit official documentation. We will briefly cover some of the more important options as they pertain to embedded devices.
- Fluent Bit can buffer data either in memory or on the filesystem. The default setting here is memory and it's our default as well.
- If using memory buffering make sure to limit how much memory can be used to buffer your data. Otherwise excessive memory usage may lead to negative effects on other parts of the system. Also keep in mind memory is volatile and thus this data could be lost if the system were to lose power, crash, reboot, or otherwise.
- If using filesystem buffering make sure to set how much data can be buffered on the filesystem to avoid running out of space on flash. Also be aware of how often writes occur to the flash with this option, as this could lead to flash degradation over time. On a final note make sure to choose a storage path that would not be affected by OTA updates like
/var
to avoid your data being lost or overwritten during an update.
- Related to buffering is retries. This controls how much and how often Fluent Bit will try to send data when it is unable to. Data that is still retrying will be held in the buffer until it either reaches it's max retry count or it can be succesfully outputted.
- Choose a retry limit that makes sense for your system. Too small and data could get discarded too quickly. Too large and pending data will build-up in your buffer, possibly reaching the set buffer limit.
- Fluent Bit has a more detailed document on configuring retries and the various intricacies involved.
Device Monitoring Outliersβ
Torizon Cloud provides important details on your fleet, such as minimum, maximum, and average values of the metrics. However, we understand that there might be variations in the data that are not easily recognized. Previously, users were required to manually check each device, which was time-consuming and labor-intensive. This process becomes more efficient with the device monitoring outliers. After detecting a variation in the provided values, users can immediately use remote access to contact the device causing the variation, look into it, and fix it without the need for manual intervention.
For each metric, you will find the outliers by clicking the icon on the top right corner.
The icon will provide the top 10 devices on the metric data, and each device will have the option to open its device page or instantly start a remote access connection.
Webinarsβ
Toradex has presented webinars about Device Monitoring and you can watch them on demand.
Secure Device Monitoring - Check Health, Resources and Performanceβ
Learn more about this webinar on the landing page, or watch it below:
Blogsβ
You can learn about an interesting application for Device Monitoring in our Flash Health Monitoring Blog.