Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tegra: Add support for commit service and verifying root and boot slot alignment #1

Open
wants to merge 5 commits into
base: dunfell
Choose a base branch
from

Conversation

dwalkes
Copy link

@dwalkes dwalkes commented Nov 6, 2020

Tegra platforms Xavier NX and TX2 support a redundant boot with A/B update scheme using nvbootctrl as the mechanism to setup and control boot slots. In addition, the TX2 supports either a cboot or u-boot based boot scheme, with u-boot as the default for the MACHINE. For Xavier NX, cboot is the default bootloader. Nano supports only u-boot based boot, and nano boot redundancy is not currently supported by the meta-mender-community meta-mender-tegra layer.

With the cboot based implementations, A/B slots used for bootloader directly correspond to the root filesystem slot selected at boot time. The nvbootctrl slot is the single source of truth, and mender bases the selection of the active rootfs for update purposes using the fake libubootenv scripts.

With u-boot redundant boot implementation in TX2, the A/B slots used for the bootloader do not directly correspond to the root filesystem slot selected at boot time. The u-boot parameters are controlled by mender, while the boot slots are controlled by nvbootctrl. This can lead to a scenario where the NVIDIA boot components update the boot slot, based on Update State Machine, while the mender components do not change. This can result in a mismatch of the boot slot and root filesystem slot. This often manifests as failure during commit as discussed at OE4T#7 and this post on mender hub.

The easiest way to reproduce this problem on u-boot systems is by forgetting to run mender -commit after installing and rebooting in standalone install mode. When you forget this step, the NVIDIA bootloader starts a retry count, waiting for nvbootctrl mark-boot-successful to be set as described in bootloader documentation. After 7 boot attempts without committing, the NVIDIA bootloader rolls back to the previous slot. However, since mender and u-boot are not synchronized, the mender rootfs slot still references the wrong slot. If there's no mismatch between bootloader version and rootfs version you won't actually notice this difference unless you specifically look at the output of nvbootctrl get-current-slot. If you try another mender -install (or install through hosted mender) with a rootfs/boot slot mismatch, the artifact install step succeeds, however the mender -commit fails with messages like this:

---------- end of script output
ERRO[0000] ArtifactCommit_Enter script failed: statescript: error executing 'ArtifactCommit_Enter_80_bl-check-update': 1 : exit status 1
Rolling back Artifact...
INFO[0000] Setting partition for rollback: 33
ERRO[0001] statescript: error executing 'ArtifactCommit_Enter_80_bl-check-update': 1 : exit status 1

since the commit script detects a bootloader slot mismatch. At this point the only way to recover is to manually reset the boot slot to match the root partition slot using nvbootctrl set-active-boot-slot or changing the u-boot environment variables and associated root filesystem slot.

To avoid this scenario, this PR proposes two changes:

  • A wrapper script which detects when mender is used in standalone mode and schedules an auto-commit on the next update.
    This makes it less likely you will end up in the scenario described above which is probably the most likely case where this will happen and is particularly important for u-boot platforms, but also useful for others.
  • Checks which validate and warn about boot and root filesystem slot mismatches at boot time for u-boot platforms and automatically:
    • Set the boot slot controlled by NVIDIA software to match the rootfs slot controlled by mender, using the mender designated rootfs slot as the source of truth (see tegra: Add support for commit service and verifying root and boot slot alignment #1 (comment)).
    • Stop the mender-client service to disable upgrade attempts before the next reboot. This prevents the possibility of corrupting the boot or root filesystem slot with a partial mender update and preventing the possiblity to get the boot slots back in alignment.

This makes the u-boot logic more closely match the cboot logic for mender update and tries to correct issues with boot slot and rootfs partition mismatch as they happen rather than waiting for the next update attempt.

@jajoosiddhant jajoosiddhant force-pushed the nvbootctrl-boot-slot-fix-wip branch from d47757d to 3b13255 Compare November 7, 2020 01:07
The mender boot slot values are not in sync with nvbootctrl slot values. The values are updated correctly in the mender_boot_part variable, but the nvbootctrl fails to sync up with the latest boot partition to use on the next reboot after a mender update sometimes.
This commit would rectify the issue by identifying the mismatch and giving a Warning and dumping all the data for debug.
At the same time it would set the active boot partition based on the value in mender_boot_partition, so that there is no mismatch going forward.
This script would run after every mender update and will run as a one shot service.
@jajoosiddhant jajoosiddhant force-pushed the nvbootctrl-boot-slot-fix-wip branch from 3b13255 to 98e373c Compare November 7, 2020 01:14
Move verify boot slot alignment scripts into separate files so we
can support the client commit service in non redundant bootloader
cases like t210 and cboot cases where no verification is necessary

Signed-off-by: Dan Walkes <danwalkes@boulderai.com>
@dwalkes dwalkes force-pushed the nvbootctrl-boot-slot-fix-wip branch from 8a559c8 to 835fe55 Compare November 7, 2020 15:31
@dwalkes
Copy link
Author

dwalkes commented Nov 7, 2020

@jajoosiddhant see changes in my most recent commit to support nano which doesn't have redundant bootloader support, as well as cboot cases. This way we can use the mender commit service on all platforms.

I've verified build on tegra-demo-distro, will verify it works correctly next.

@dwalkes dwalkes changed the title Adding mender tegra-state-script to align boot and rootfs partition tegra: Add support for commit service and verifying root and boot slot alignment Nov 7, 2020
@dwalkes
Copy link
Author

dwalkes commented Nov 7, 2020

@jajoosiddhant I've reworked the description, please review and edit/add as needed.

@dwalkes
Copy link
Author

dwalkes commented Nov 7, 2020

@jajoosiddhant as I wrote this up I was wondering if we should be using nvbootctrl based slots as the source of truth in a mismatch case rather than mender slots, in other words, write the u-boot parameters here to match the active boot slot from nvbootctrl instead of writing the nvbootctrl slot to match the mender slots. I think the reason we started with making the mender slot the source of truth was when we were thinking we could fix this in sync with mender update execution. The benefit of using nvbootctrl as the source of truth is this way cboot and u-boot platforms would work the same way in that respect.

@dwalkes
Copy link
Author

dwalkes commented Nov 7, 2020

Added thread at https://hub.mender.io/t/auto-commit-for-standalone-mender-updates/2791 to ask about mender -commit strategy implemented here.

@jajoosiddhant
Copy link

@dwalkes

@jajoosiddhant I've reworked the description, please review and edit/add as needed.

Way better than what I had written.

@jajoosiddhant as I wrote this up I was wondering if we should be using nvbootctrl based slots as the source of truth in a mismatch case rather than mender slots, in other words, write the u-boot parameters here to match the active boot slot from nvbootctrl instead of writing the nvbootctrl slot to match the mender slots.

But then the user would boot up from a different rootfs on reboot than he was working on in case of mismatch. We would not want to do that since we wouldn't want to rollback to a different rootfs since we do not guarantee if that is corrupted or not.
I think we only want to move to the other rootfs after an update succeeds, guaranteeing us that the bootloader and rootfs were both updated and thus it would be safe to switch.
That guarantee can be obtained only through mender slots since those are only modified after a successful mender update whereas we are not sure about when nvbootctrl changes its boot slots. We have seen cases where the nvbootctrl active boot slot does not change even after a successful mender update. If the nvbootctrl slot does not change and the mender update was successful, we can be sure to use the boot slot corresponding to mender_boot_part

@dwalkes
Copy link
Author

dwalkes commented Nov 7, 2020

But then the user would boot up from a different rootfs on reboot than he was working on in case of mismatch. We would not want to do that since we wouldn't want to rollback to a different rootfs since we do not guarantee if that is corrupted or not.

I think we have the same scenario with the boot slot though. If we change the boot slot we can't guarantee the other one isn't corrupted. It's actually probably more likely that it is, given the fact that NVIDIA bootloader software switched it to begin with.

 * Use correct quoting on echo arguments so match succeeds
 * Move verify script into bin dir, rename with mender-tegra prefix
When boot slots aren't aligned (as detected/corrected by the verify
alignment script) we need to set a marker file in the volatile FS
to prevent future mender -install attempts from running until
the next reboot when boot slots are once again aligned.
@dwalkes
Copy link
Author

dwalkes commented Nov 8, 2020

@jajoosiddhant the latest push appears to be working as tested on both jetson-nano-qspi-sd and jetson-tx2 uboot config on the dunfell branch of tegra-demo-distro.

I ran through the tests in /~https://github.com/OE4T/tegra-demo-distro/wiki/Mender-Integration-Tests

@@ -10,6 +10,11 @@ if [ $? -eq 0 ]; then
# Exit with failure and error message if we don't have alignment
# between boot slot and rootfs. It's not safe to update in this case
mender-tegra-verify-boot-rootfs-slot-alignment || exit 1
if [ -e /var/volatile/mender-tegra-boot-slot-mismatch-install-disabled ]; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using /run instead of /var/volatile as the location for this sentinel file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! @jajoosiddhant can you please make this change before you test on Monday?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified

@manuel-wagesreither
Copy link

manuel-wagesreither commented Nov 9, 2020

I applied this PR to zeus and tested it on one of our devices to see if it solves our problem as well.

Unfortunately, it doesn't. It does change the symptoms, though.

For reference, this is how the problem manifests on our devices, which are on some zeus branch. We flashed into slot 1. (When flashing into slot 0, the values are slightly different but the overall behaviour is the same.)

  | Priority Slot 0 | Priority Slot 1 | nvbootctrl get-current-slot | fw_printenv mender_boot_part |  
- +-----------------+-----------------+-----------------------------+------------------------------+
0 |        14       |        15       |              1              |              33              |
1 |        13       |        15       |              1              |              33              |
2 |        15       |        14       |              1              |              33              |
3 |        14       |        14       |              0              |              33              | 
4 |        14       |        15       |              0              |              33              |
5 |                               same as 1                                                        |
6 |                               same as 2                                                        |
7 |                               same as 3                                                        |
8 |                               and so on                                                        |

boot_successful is always 1
retry_count is always 7

With this PR applied, the we observe the following:

  • Boot 0: Similar to boot 0 as described above.
  • Boot 1: Similar to boot 1 as described above.
  • Boot 2: Similar to boot 0 as described above, with the exception that
    • boot_successful of slot 1 is 0. (This is in spite the system does actually run from slot 1.)
  • Boot 3: Similar to boot 0 as described above, with the exception that
    • boot_successful of slot 1 is 0 and
    • retry_count of slot 1 is 6.
  • Boot 4 and subsequent: retry_count gets decreased with every boot.
  • Boot 10: Similar to boot 0 as described above, except that
    • boot_successful of slot 1 is 0 and
    • retry_count of slot 1 is 0 and
    • nvbootctrl get-current-slot returns 0. (Altough it still runs from slot 1.)

We're not using standalone but hosted mender.

@jajoosiddhant
Copy link

@manuel-wagesreither

This is what I got when I tried to reproduce your problem after applying the patch:

Boot number Priority slot 0, retry_count, boot_successful Priority slot 1, retry_count, boot_successful Nvbootctrl get-current-slot Fw_printenv -n mender_boot_part
0 14,7,0, 15,7,0 1 33
1 13,7,0 15,7,1 1 33
2 14,7,0 15,7,0 1 33
3 14,7,0 15,6,0 1 33
4 14,7,0 15,5,0 1 33
5 14,7,0 15,4,0 1 33
6 14,7,0 15,3,0 1 33

Are you sure that the patch was applied in your build?
It seems that the service that was included in the patch did not deploy.
Can you please check the logs using journalctl -u mender-client-commit
Note: The service did detect a mismatch after boot2 which did not seem to happen as per your log.
I have tested this using standalone mender update. Maybe you should try reproducing your issue using standalone and I will verify via hosted mender.
These are what I got:

Boot 0:
-- Logs begin at Mon 2020-11-09 17:41:57 UTC, end at Mon 2020-11-09 17:47:49 UTC. --
Nov 09 17:42:18 dnncam-dev6000 systemd[1]: Starting Automatic Mender Commit Service...
Nov 09 17:42:18 dnncam-dev6000 mender-client-commit.sh[5005]: Verify if nvbootctrl slot matches rootfs partition
Nov 09 17:42:18 dnncam-dev6000 mender-client-commit.sh[5005]: Verified
Nov 09 17:42:18 dnncam-dev6000 mender-client-commit.sh[5057]: Starting mender wrapper
Nov 09 17:42:18 dnncam-dev6000 mender-client-commit.sh[5060]: time="2020-11-09T17:42:18Z" level=info msg="Loaded configuration file: /var/lib/mender/mender.conf"
Nov 09 17:42:18 dnncam-dev6000 mender-client-commit.sh[5060]: time="2020-11-09T17:42:18Z" level=info msg="Loaded configuration file: /etc/mender/mender.conf"
Nov 09 17:42:19 dnncam-dev6000 mender-client-commit.sh[5060]: time="2020-11-09T17:42:19Z" level=info msg="Mender running on partition: /dev/mmcblk0p33"
Nov 09 17:42:19 dnncam-dev6000 mender-client-commit.sh[5060]: time="2020-11-09T17:42:19Z" level=info msg="Update Module path \"/usr/share/mender/modules/v3\" could not be opened (open /usr/share/mender/modules
/v3: no such file or directory). Update modules will not be available"
Nov 09 17:42:19 dnncam-dev6000 mender-client-commit.sh[5060]: Committing Artifact...
Nov 09 17:42:19 dnncam-dev6000 mender-client-commit.sh[5060]: time="2020-11-09T17:42:19Z" level=info msg="Committing update"
Nov 09 17:42:19 dnncam-dev6000 systemd[1]: mender-client-commit.service: Succeeded.
Nov 09 17:42:20 dnncam-dev6000 systemd[1]: Started Automatic Mender Commit Service.


Boot 1:
-- Logs begin at Mon 2020-11-09 17:50:09 UTC, end at Mon 2020-11-09 17:55:29 UTC. --
Nov 09 17:50:29 dnncam-dev6000 systemd[1]: Starting Automatic Mender Commit Service...
Nov 09 17:50:28 dnncam-dev6000 mender-client-commit.sh[5000]: Verify if nvbootctrl slot matches rootfs partition
Nov 09 17:50:29 dnncam-dev6000 mender-client-commit.sh[5000]: Verified
Nov 09 17:50:29 dnncam-dev6000 systemd[1]: mender-client-commit.service: Succeeded.
Nov 09 17:50:29 dnncam-dev6000 systemd[1]: Started Automatic Mender Commit Service.

Boot 2:
Nov 09 17:57:54 dnncam-dev6000 systemd[1]: Starting Automatic Mender Commit Service...
Nov 09 17:57:54 dnncam-dev6000 mender-client-commit.sh[4977]: Verify if nvbootctrl slot matches rootfs partition
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: ********************* WARNING!!!!! *********************
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: Partition mismatch for boot and rootfs on reboot
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: mender boot slot: 1
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: nvbootctrl slot: 0
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: Setting the current nvbootctrl slot manually to the value of mender boot slot
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: ********************* WARNING!!!!! *********************
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: Dumping partition information
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: 1
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: magic:0x43424e00, version: 3 features: 3 num_slots: 2 slot: 0, priority: 14, suffix: _a, retry_count: 7, boot_successful: 0 slot: 1, priority: 15,
suffix: _b, retry_count: 7, boot_successful: 0
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: 33
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: ********************* WARNING!!!!! *********************
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: Disabling mender-client systemd service to prevent any more updates
Nov 09 17:57:55 dnncam-dev6000 mender-client-commit.sh[4977]: Reboot the device to set the active slot before proceeding for a mender -install or mender -commit
Nov 09 17:57:55 dnncam-dev6000 systemd[1]: mender-client-commit.service: Succeeded.
Nov 09 17:57:55 dnncam-dev6000 systemd[1]: Started Automatic Mender Commit Service.

Boot 3:
-- Logs begin at Mon 2020-11-09 18:01:32 UTC, end at Mon 2020-11-09 18:02:55 UTC. --
Nov 09 18:01:52 dnncam-dev6000 systemd[1]: Starting Automatic Mender Commit Service...
Nov 09 18:01:52 dnncam-dev6000 mender-client-commit.sh[4971]: Verify if nvbootctrl slot matches rootfs partition
Nov 09 18:01:53 dnncam-dev6000 mender-client-commit.sh[4971]: Verified
Nov 09 18:01:53 dnncam-dev6000 systemd[1]: mender-client-commit.service: Succeeded.
Nov 09 18:01:53 dnncam-dev6000 systemd[1]: Started Automatic Mender Commit Service.


Boot 4:
-- Logs begin at Mon 2020-11-09 18:04:40 UTC, end at Mon 2020-11-09 18:08:01 UTC. --
Nov 09 18:04:59 dnncam-dev6000 systemd[1]: Starting Automatic Mender Commit Service...
Nov 09 18:04:59 dnncam-dev6000 mender-client-commit.sh[4974]: Verify if nvbootctrl slot matches rootfs partition
Nov 09 18:04:59 dnncam-dev6000 mender-client-commit.sh[4974]: Verified
Nov 09 18:04:59 dnncam-dev6000 systemd[1]: mender-client-commit.service: Succeeded.
Nov 09 18:04:59 dnncam-dev6000 systemd[1]: Started Automatic Mender Commit Service.

Boot 5:
-- Logs begin at Mon 2020-11-09 18:09:28 UTC, end at Mon 2020-11-09 18:10:31 UTC. --
Nov 09 18:09:49 dnncam-dev6000 systemd[1]: Starting Automatic Mender Commit Service...
Nov 09 18:09:49 dnncam-dev6000 mender-client-commit.sh[5063]: Verify if nvbootctrl slot matches rootfs partition
Nov 09 18:09:49 dnncam-dev6000 mender-client-commit.sh[5063]: Verified
Nov 09 18:09:49 dnncam-dev6000 systemd[1]: mender-client-commit.service: Succeeded.
Nov 09 18:09:49 dnncam-dev6000 systemd[1]: Started Automatic Mender Commit Service.

~

@jajoosiddhant
Copy link

I was able to reproduce the issue that @manuel-wagesreither had where the boot partition was switching if we don't have this patch applied.
Moreover, I was able to reproduce the same issue using cboot as well where the boot partition gets switched after the third reboot after an update, thus switching rootfs as well since in cboot the boot partition and rootfs are in sync.

For cboot it looks like this

Boot number Priority slot 0 Priority slot 1 Nvbootctrl get-current-slot
0 14 15 1
1 13 15 1
2 15 14 1
3 15 14 1

@manuel-wagesreither
Copy link

manuel-wagesreither commented Nov 9, 2020

@jajoosiddhant

This is what I got when I tried to reproduce your problem after applying the patch:

[...]

Are you sure that the patch was applied in your build?
It seems that the service that was included in the patch did not deploy.
Can you please check the logs using journalctl -u mender-client-commit

I'm not sure if you're looking at the right part of my post. Please note that the table I posted shows our system behaviour without the patch applied. The behaviour with the patch applied is noted underneath. I'm wondering, because you phrase it as if our systems would show different behavior, altough the behaviour looks pretty similar to me:

  • Slot priorities change in the same way
  • Retry count of slot 1 starts to decrease with boot 3

Please note the decreasing retry count, which happens at your systems as well. I think this alone makes this PR not merge-able yet.

Note: The service did detect a mismatch after boot2 which did not seem to happen as per your log.

I do think this happened to our device as well. At boot 2 (which is the third boot) our system starts to show different behavior depending on whether the patch is applied or not. So I think it's your service which kicked in. I can't check as I currently haven't got access to the device.

Thank you for looking into this by the way. Your work is of great help to us.

@jajoosiddhant
Copy link

I'm wondering, because you phrase it as if our systems would show different behavior, altough the behaviour looks pretty similar to me

Sorry for the miscommunication, kind of read your comment incorrectly. I can definitely see what you are seeing.

I tried to get rid of the decrementing retry count after an update by setting boot_successful=1 for the current boot slot on every reboot but that did not work at all. It just toggled between these two states giving me a partition mismatch error on every alternate reboot.

**Boot 1:**
root@dnncam-dev6000:~# /data/slot_info.sh
Dump Slot Info:
magic:0x43424e00,             version: 3             features: 3             num_slots: 2
slot: 0,             priority: 14,             suffix: _a,             retry_count: 7,             boot_successful: 1
slot: 1,             priority: 14,             suffix: _b,             retry_count: 7,             boot_successful: 0
Current Slot: 0
Mender Boot Part: 1

**Boot2:**
root@dnncam-dev6000:~# /data/slot_info.sh
Dump Slot Info:
magic:0x43424e00,             version: 3             features: 3             num_slots: 2
slot: 0,             priority: 15,             suffix: _a,             retry_count: 7,             boot_successful: 0
slot: 1,             priority: 14,             suffix: _b,             retry_count: 7,             boot_successful: 0
Current Slot: 0
Mender Boot Part: 1

I guess the problem starts as soon as we see the priority number 13 on any of the slots. There has been cases where I was able to mender update smoothly without any issues. The nvidia documentation too does not mention any state with priority number 13 for any of the slots.

The issue can also be in updating to different BUPs but we have not gone down the road to investigate that for now.
It seems that for the shorter term, we will need to find ways to actually assign the slot bootable and the other invalid based on nvbootctrl documentation and balance the priority number to the bootable slot as 15 and the invalid one as 14 so that it does not switch boot partitions on reboots. I am gonna go down that road next to see if I can get something to work for mender updates in the short term.

@jajoosiddhant
Copy link

I also tried removing the ArtifactInstall_Leave_80_bl-update script so that we don't update the bootloader and just boot from slot 0 for either rootfs which seemed to work.
The priorities never got messed up and I could reboot for atleast 6 times without obviously booting from the other slot in case of uboot.
This seems to point an issue about the way nv_update_engine udpates BUP and changes priorities for both the slots.

@dwalkes
Copy link
Author

dwalkes commented Nov 11, 2020

See OE4T#8 for the latest status of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants