Watchdog in Linux
The Watchdog is a hardware-implemented system hang control scheme.
Watchdog timers are used in systems that must operate without human supervision. Such systems must be self-restoring without operator intervention.
Note
The JetHome JetHub controllers based on Amlogic processors support the hardware watchdog driver meson_wdt
.
Checking devices
With the watchdog module running correctly, /dev
should be able to see the devices /dev/watchdog
and /dev/watchdog0
:
$ ls -l /dev/watchdog*
crw-rw---- 1 root root 10, 130 2019-01-01 00:00 /dev/watchdog
crw-rw---- 1 root root 10, 130 2019-01-01 00:00 /dev/watchdog0
Installing the Watchdog service
Note
In firmware versions starting from Armbian 22.02 watchdog is pre-installed, manual installation is not required.
To install the watchdog service, run the following commands:
sudo apt-get update sudo apt-get install watchdog
Create a folder for watchdog log files:
sudo mkdir -p /var/log/watchdog
Check the settings for the service in the file /etc/default/watchdog
:
# Start watchdog at boot time? 0 or 1 run_watchdog=1 # Start wd_keepalive after stopping watchdog? 0 or 1 run_wd_keepalive=1 # Load module before starting watchdog watchdog_module="none" # Specify additional watchdog options here (see manpage). watchdog_options="-s -v -c /etc/watchdog.conf"
Configuration files
Make the necessary changes to the configuration file /etc/watchdog.conf
:
Uncomment the use of the device
/dev/watchdog
(otherwise watchdog service will not use the hardware timer to reboot the controller).Set the necessary checks and timeouts.
Below is an example of a configuration file, with a timeout set to 15 seconds for hanging:
Note
The value
watchdog-timeout
defines how long after the watchdog service failure the hardware timer will restart the controller.$ cat /etc/watchdog.conf #ping = 172.31.14.1 #ping = 172.26.1.255 #interface = eth0 #file = /var/log/messages #change = 1407 # Uncomment to enable test. Setting one of these values to '0' disables it. # These values will hopefully never reboot your machine during normal use # (if your machine is really hung, the loadavg will go much higher than 25) #max-load-1 = 24 #max-load-5 = 18 #max-load-15 = 12 # Note that this is the number of pages! # To get the real size, check how large the pagesize is on your machine. #min-memory = 1 #allocatable-memory = 1 #repair-binary = /usr/sbin/repair #repair-timeout = 60 #test-binary = #test-timeout = 60 # The retry-timeout and repair limit are used to handle errors in a more robust # manner. Errors must persist for longer than retry-timeout to action a repair # or reboot, and if repair-maximum attempts are made without the test passing a # reboot is initiated anyway. #retry-timeout = 60 #repair-maximum = 1 watchdog-device = /dev/watchdog # Defaults compiled into the binary #temperature-sensor = #max-temperature = 90 # Defaults compiled into the binary #admin = root #interval = 1 #logtick = 1 #log-dir = /var/log/watchdog # This greatly decreases the chance that watchdog won't be scheduled before # your machine is really loaded realtime = yes priority = 1 # Check if rsyslogd is still running by enabling the following line #pidfile = /var/run/rsyslogd.pid watchdog-timeout = 15
Autostart and service check
To enable the service autorun, run the following commands:
sudo systemctl enable watchdog sudo systemctl start watchdog
Check the serviceability of the service:
service watchdog statusApproximate conclusion:
● watchdog.service - watchdog daemon Loaded: loaded (/lib/systemd/system/watchdog.service; enabled; vendor pres> Active: active (running) since Mon 2022-02-21 17:29:24 UTC; 17h ago Process: 2718 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${w> Process: 2720 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin> Main PID: 2722 (watchdog) Tasks: 1 (limit: 977) Memory: 516.0K CPU: 3min 33.528s CGroup: /system.slice/watchdog.service └─2722 /usr/sbin/watchdog -s -v -c /etc/watchdog.conf Feb 22 10:52:23 jethubj100 watchdog[2722]: still alive after 62076 interval(s) Feb 22 10:52:24 jethubj100 watchdog[2722]: still alive after 62077 interval(s) Feb 22 10:52:25 jethubj100 watchdog[2722]: still alive after 62078 interval(s) Feb 22 10:52:26 jethubj100 watchdog[2722]: still alive after 62079 interval(s) Feb 22 10:52:27 jethubj100 watchdog[2722]: still alive after 62080 interval(s) Feb 22 10:52:28 jethubj100 watchdog[2722]: still alive after 62081 interval(s) Feb 22 10:52:29 jethubj100 watchdog[2722]: still alive after 62082 interval(s)
Note
A configured and running watchdog service constantly resets the hardware watchdog timer.
If it fails to do so (for example, if the system hangs or any other configured condition occurs), the timer will trigger and restart the controller.
Watchdog timer check
Warning
Be careful: the commands in this section cause the kernel to panic and stop the controller completely.
Use them only in a test environment!
The following command causes the linux kernel to crash artificially. If watchdog works correctly, it will automatically reboot the system after a timeout:
echo c > /proc/sysrq-trigger
Approximate conclusion:
[63168.053150] sysrq: Trigger a crash [63168.053204] Kernel panic - not syncing: sysrq triggered crash [63168.056648] CPU: 3 PID: 65544 Comm: bash Not tainted 5.15.24-meson64 #trunk.0045.jethome.0 [63168.064838] Hardware name: JetHome JetHub J100 (DT) [63168.069670] Call trace: [63168.072082] dump_backtrace+0x0/0x200 [63168.075706] show_stack+0x18/0x68 [63168.078982] dump_stack_lvl+0x68/0x84 [63168.082605] dump_stack+0x18/0x34 [63168.085882] panic+0x164/0x324 [63168.088901] sysrq_handle_crash+0x1c/0x20 [63168.092869] __handle_sysrq+0x8c/0x160 [63168.096577] write_sysrq_trigger+0x88/0x120 [63168.100719] proc_reg_write+0xac/0xf8 [63168.104340] vfs_write+0xbc/0x398 [63168.107618] ksys_write+0x68/0xf0 [63168.110895] __arm64_sys_write+0x1c/0x28 [63168.114776] invoke_syscall+0x44/0x108 [63168.118485] el0_svc_common.constprop.3+0x94/0xf8 [63168.123143] do_el0_svc+0x24/0x88 [63168.126420] el0_svc+0x20/0x50 [63168.129439] el0t_64_sync_handler+0x90/0xb8 [63168.133579] el0t_64_sync+0x180/0x184 [63168.137206] SMP: stopping secondary CPUs [63168.141091] Kernel Offset: disabled [63168.144533] CPU features: 0x00001001,00000846 [63168.148845] Memory Limit: none [63168.151871] ---[ end Kernel panic - not syncing: sysrq triggered crash ]--- AXG:BL1:d1dbf2:a4926f;FEAT:E0DC318C:2000;POC:F;EMMC:0;READ:0;0.0;CHK:0; sdio debug board detected TE: 33151Pause 15 seconds (according to the value
watchdog_timeout
in the configuration file), thenBL2 Built : 10:43:22, May 26 2021. axg g28b9431 - jenkins@walle02-sh set vcck to 1100 mv set vddee to 950 mv Board ID = 9 CPU clk: 1200MHz DDR low power enabled DDR3 chl: Rank0 16bit @ 912MHz bist_test rank: 0 1b 02 34 24 0b 3d 17 00 2f 27 0e 40 00 00 00 00 00 00 00 00 00 00 00 00 761 - PASS Rank0: 1024MB(auto)-2T-13 AddrBus test pass! eMMC boot @ 0 sw8 s storage init finish emmc switch 3 ok Authentication key not yet programmed get rpmb counter error 0x00000007 emmc switch 0 ok Load FIP TMP HDR from eMMC, src: 0x0000c200, des: 0x05100000, size: 0x00004000, part: 0 0001c000Load BL31 from eMMC, src: 0x0001c200, des: 0x05104000, size: 0x0002ac00, part: 0 bl2z: ptr: 05127358, size: 00001e18 Load FIP HDR from eMMC, src: 0x0000c200, des: 0x01700000, size: 0x00004000, part: 0 Load BL3x from eMMC, src: 0x00010200, des: 0x01704000, size: 0x0008e400, part: 0 NOTICE: BL31: v1.3(release):110e239 NOTICE: BL31: Built : 19:07:23, Jul 2 2018 NOTICE: BL31: AXG normal boot! NOTICE: BL31: BL33 decompress pass [Image: axg_v1.1.3326-d0bacc8 2018-07-05 11:21:34 jenkins@walle02-sh] OPS=0x43 25 0b 43 00 88 fc 1a 07 8d 24 0c 3a b2 65 16 59 bl30:axg ver: 9 mode: 0 bl30:axg thermal0 [0.015862 Inits done] secure task start! high task start! low task start! ERROR: Error initializing runtime service opteed_fast U-Boot 2022.01-armbian (Feb 08 2022 - 06:07:00 +0000) jethubj100