Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save transaction: database is locked on podman exec #20809

Closed
jpalus opened this issue Nov 28, 2023 · 16 comments · Fixed by #20838
Closed

save transaction: database is locked on podman exec #20809

jpalus opened this issue Nov 28, 2023 · 16 comments · Fixed by #20838
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. sqlite Bugs pertaining to the sqlite database backend

Comments

@jpalus
Copy link

jpalus commented Nov 28, 2023

Issue Description

Upgraded to podman 4.8.0, did a system reset and took sqlite db for a spin. Started creation of 3 independent containers in parallel followed by some setup steps done with podman exec and one such invocation ended up with:

ERRO[0008] Container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d exec session cda744d36c8c72ce7f701cad851001f22e32432855dae12f1cda745bf4372aea error: saving container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d state: beginning container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d save transaction: database is locked

Steps to reproduce the issue

No reproducible steps I'm afraid. I assume it's a timing issue.

Describe the results you received

podman exec failed once with:

ERRO[0008] Container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d exec session cda744d36c8c72ce7f701cad851001f22e32432855dae12f1cda745bf4372aea error: saving container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d state: beginning container 68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d save transaction: database is locked

Describe the results you expected

No error should be present.

podman info output

host:
  arch: arm64
  buildahVersion: 1.33.2
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: Unknown
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: 9aa8067313b1282097df84ab1a218149999397e4'
  cpuUtilization:
    idlePercent: 98.65
    systemPercent: 1.05
    userPercent: 0.29
  cpus: 8
  databaseBackend: sqlite
  distribution:
    distribution: pld
    version: "3.0"
  eventLogger: journald
  freeLocks: 2045
  hostname: rock
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.10.110-8-rockchip
  linkmode: dynamic
  logDriver: journald
  memFree: 9260699648
  memTotal: 16487534592
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: Unknown
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.9.0
    package: Unknown
    path: /usr/libexec/podman/netavark
    version: netavark 1.9.0
  ociRuntime:
    name: crun
    package: Unknown
    path: /usr/bin/crun
    version: |-
      crun version 1.12
      commit: ce429cb2e277d001c2179df1ac66a470f00802ae
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: ""
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: Unknown
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.4
  swapFree: 8096051200
  swapTotal: 8242851840
  uptime: 1284h 47m 38.00s (Approximately 53.50 days)
  variant: v8
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /home/users/jan/.config/containers/storage.conf
  containerStore:
    number: 3
    paused: 0
    running: 3
    stopped: 0
  graphDriverName: btrfs
  graphOptions: {}
  graphRoot: /mnt/build-storage/jan/containers
  graphRootAllocated: 274877906944
  graphRootUsed: 12658704384
  graphStatus:
    Build Version: Btrfs v6.6.2
    Library Version: "102"
  imageCopyTmpDir: /home/users/jan/tmp
  imageStore:
    number: 3
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /mnt/build-storage/jan/containers/volumes
version:
  APIVersion: 4.8.0
  Built: 1701166825
  BuiltTime: Tue Nov 28 11:20:25 2023
  GitCommit: bf038720ca8b9301a3aa80966759567c41199dd0
  GoVersion: go1.21.4
  Os: linux
  OsArch: linux/arm64
  Version: 4.8.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

No response

@jpalus jpalus added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2023
@Luap99
Copy link
Member

Luap99 commented Nov 28, 2023

@mheon @edsantiago Did you see this error before?

@jpalus Any chance you can share what commands you were running? I agree that it looks like a race when running podman in parallel.

@Luap99 Luap99 added the sqlite Bugs pertaining to the sqlite database backend label Nov 28, 2023
@jpalus
Copy link
Author

jpalus commented Nov 28, 2023

@Luap99 I guess I could though I guess they won't be of much use. These are pretty specific containers. From a high level perspective:

  1. podman create of systemd enabled container running full init
  2. Followed by podman start
  3. Followed by multiple simple podman exec doing some package installations and simple setup tuning.

Command that failed with mentioned error was package installation -- after installation was done and podman was returning back to shell.

@edsantiago
Copy link
Member

Ugh.... no, I haven't seen "database is locked" since May. But we do zero parallel testing.

@mheon
Copy link
Member

mheon commented Nov 28, 2023

@Luap99 Yes - this is why we disabled WAL mode, which was enough to calm CI. Evidently not sufficient.

I think we may need some form of retry wrapper around SQLite operations, because that error should not be fatal - we ought to be able to spin until the DB is unlocked and continue without issue.

@mheon
Copy link
Member

mheon commented Nov 28, 2023

(As a bonus, we could investigate reenabling WAL mode if we had such a wrapper)

@jpalus
Copy link
Author

jpalus commented Nov 28, 2023

And a new occurrence when I wanted to remove those containers:

$ podman rm -f -a
68e95ee6822e1975764367770b5f9def2717de18a0624d71f1c857809303343d
ca58f9876ea2081d5d9881fefae04926a36444e1b85810b9b0a97229fba1824a
Error: database is locked

@Luap99
Copy link
Member

Luap99 commented Nov 28, 2023

Doesn't retry just hide the real issue? At least reading https://www2.sqlite.org/cvstrac/wiki?p=DatabaseIsLocked this seems like a logic bug in our code.

By looking at all the transaction code paths. I found at least one call that does not call Commit() on the transaction.

func (s *SQLiteState) RemovePodContainers(pod *Pod) (defErr error) {

But this is rare pod rm code paths lo I find it unlikely that it is hit here.

@Luap99
Copy link
Member

Luap99 commented Nov 29, 2023

I had 10+ podman run processes running in parallel for hours but never got any error, I also tried with exec. I think there might be more to the story here.

One difference is I am on x86_64 and you one aarch64. Also looking at performance during my stress testing podman was IO bound so I wonder if it has to do anything with the underlying fs/disk speed.
Another difference is kernel 5.10 vs 6.5 on my system.

@jpalus
Copy link
Author

jpalus commented Nov 29, 2023

Checked on arm64/kernel 6.6.3 as well as x86/kernel 6.6.0 with same outcome. I had more luck with reproducing the latter error during podman rm -f -a. For me following works pretty consistently (although not a 100% reproducer):

  1. podman system reset
  2. Create 100 containers possibly more
for i in `seq 1 100`; do podman create docker.io/fedora; done
  1. Remove all containers but slow it down with strace to possibly make timing issues more likely:
strace -o log -s1024 -f podman rm -a

@Luap99
Copy link
Member

Luap99 commented Nov 29, 2023

Thanks, running with strace does indeed let me reproduce.

@Luap99 Luap99 self-assigned this Nov 29, 2023
Luap99 added a commit to Luap99/libpod that referenced this issue Nov 29, 2023
Only one process can write to the sqlite db at the same time, if another
process tries to use it at that time it fails and a database is locked
error is returned. If this happens sqlite should keep retrying until it
can write. To do that we can just set the _busy_timeout option. A 100s
timeout should be enough even on slower systems but not to much in case
there is a deadlock so it still returns in a reasonable time.

[NO NEW TESTS NEEDED] I think we strongly need to consider some form of
parallel stress testing to catch bugs like this.

Fixes containers#20809

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99
Copy link
Member

Luap99 commented Nov 29, 2023

@jpalus Any chance you can rebuild your podman with #20838 and test again?

Luap99 added a commit to Luap99/libpod that referenced this issue Nov 30, 2023
Only one process can write to the sqlite db at the same time, if another
process tries to use it at that time it fails and a database is locked
error is returned. If this happens sqlite should keep retrying until it
can write. To do that we can just set the _busy_timeout option. A 100s
timeout should be enough even on slower systems but not to much in case
there is a deadlock so it still returns in a reasonable time.

[NO NEW TESTS NEEDED] I think we strongly need to consider some form of
parallel stress testing to catch bugs like this.

Fixes containers#20809

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@jpalus
Copy link
Author

jpalus commented Nov 30, 2023

A little late to the party but so far so good. Didn't notice any database is locked errors. However #20838 makes container creation (all the setup steps combined) take about 15% more time than BoltDB or SQLite without a fix.

@jpalus
Copy link
Author

jpalus commented Nov 30, 2023

One strange issue occurred just now, that is with a stock podman 4.8.0 without a fix and sqlite db backend. Not sure if it's hidden "database locked" issue or wrong ordering somewhere. podman system reset with 3 containers running failed with:

ERRO[0014] image used by 929b991f97c04ce7d5a203abb206626c2285e63fe45e9b690a44fce236006417: image is in use by a container: consider listing external containers and force-removing image 

@mheon
Copy link
Member

mheon commented Nov 30, 2023

That's probably separate; can you open a fresh bug for that? Probably missing a --force in the system reset logic somewhere.

@jpalus
Copy link
Author

jpalus commented Dec 19, 2023

Just got another instance of it with podman 4.8.1 which is supposed to fix the issue:

$ podman rm -f -a
3b804f42923c8b5674a20fc0a820d46d6a0afcbcd80883d3b7953c589a1d4c2e
8a19b804c4f766404697acf8a32e1065063802a13c04770564ba3faa1dd895c8
Error: cleaning up container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b: unmounting container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b storage: unmounting container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b: saving container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b state: beginning container 1e6a8fb3e45f5fc7139583632bff90cc048601cac11e28bf5e11816b77125e2b save transaction: database is locked

These 3 containers are slow to remove (lots of files, spinning storage) so not really sure if those transactions are really "short-lived".

@mheon
Copy link
Member

mheon commented Dec 19, 2023 via email

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Mar 19, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. sqlite Bugs pertaining to the sqlite database backend
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants