Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random XAmzContentSHA256Mismatch Errors #855

Closed
jtyoung84 opened this issue Mar 5, 2024 · 6 comments · Fixed by #858
Closed

Random XAmzContentSHA256Mismatch Errors #855

jtyoung84 opened this issue Mar 5, 2024 · 6 comments · Fixed by #858

Comments

@jtyoung84
Copy link

Describe the bug
We had a script that had been working reliably for several months. We started encountering an error that occurs without much consistency: OSError: [Errno 5] An error occurred (XAmzContentSHA256Mismatch) when calling the PutObject operation: The provided 'x-amz-content-sha256' header does not match what was computed.

To Reproduce
Steps to reproduce the behavior:

  1. Installed s3fs on python 3.11 using miniconda
  2. Ran the script in the Additional Context section that mimics what we're doing
  3. At some point between 30GB and 200GB of uploads, the error occurs and stops uploading. We've done several tests and there doesn't appear to be a reliable and easy way to reproduce the error.

Expected behavior
Ideally, the upload should work. We're working on whether we can write a custom retry step since it doesn't appear the upload is retrying.

Desktop (please complete the following information):

  • OS: CentOS 7 and other linux distributions
  • s3fs: tried 2024.2.0, s3fs 2023.12.2, and 2023.9.0

Additional context

example script (assuming bucket exists)

# conda create -n s3fs-test python=3.11
# conda activate s3fs-test
# pip install s3fs

import os
os.environ["S3FS_LOGGING_LEVEL"]="DEBUG"
import s3fs
import logging
logging.basicConfig(level=logging.INFO)

_MAX_S3_RETRIES = 2
_S3_RETRY_MODE = "adaptive"

output = "s3://some_bucket/some_prefix"

s3 = s3fs.S3FileSystem(
    anon=False,
    config_kwargs={
        'retries': {
            'total_max_attempts': _MAX_S3_RETRIES,
            'mode': _S3_RETRY_MODE,
        }
    },
    use_ssl=False,
)

store = s3fs.S3Map(root=output, s3=s3, check=False)
# Random data of ~11MB
random_bytes = bytearray(os.urandom(12000000))
# Create a values_dict with 200 keys. Upload values_dict to s3.
# Repeat 250 times. Should upload around 500GB total.
for y in range(0, 250):
    values_dict = {}
    for x in range(0, 200):
        values_dict[f"item_{y}_{x}"] = random_bytes
    store.setitems(values_dict=values_dict)

Error Message:

2024-03-01 12:17:55,741 - s3fs - DEBUG - _error_wrapper -- Client error (maybe retryable): An error occurred (XAmzContentSHA256Mismatch) when calling the PutObject operation: The provided 'x-amz-content-sha256' header does not match what was computed.
DEBUG:s3fs:Client error (maybe retryable): An error occurred (XAmzContentSHA256Mismatch) when calling the PutObject operation: The provided 'x-amz-content-sha256' header does not match what was computed.
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/aiobotocore/client.py", line 408, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (XAmzContentSHA256Mismatch) when calling the PutObject operation: The provided 'x-amz-content-sha256' header does not match what was computed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/fsspec/mapping.py", line 124, in setitems
    self.fs.pipe(values)
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/fsspec/asyn.py", line 399, in _pipe
    return await _run_coros_in_chunks(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/fsspec/asyn.py", line 254, in _run_coros_in_chunks
    await asyncio.gather(*chunk, return_exceptions=return_exceptions),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/asyncio/tasks.py", line 452, in wait_for
    return await fut
           ^^^^^^^^^
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/s3fs/core.py", line 1109, in _pipe_file
    return await self._call_s3(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/s3fs/core.py", line 348, in _call_s3
    return await _error_wrapper(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/s3fs-test/lib/python3.11/site-packages/s3fs/core.py", line 140, in _error_wrapper
    raise err
OSError: [Errno 5] An error occurred (XAmzContentSHA256Mismatch) when calling the PutObject operation: The provided 'x-amz-content-sha256' header does not match what was computed.
@martindurant
Copy link
Member

martindurant commented Mar 6, 2024

The simple fix may be to include this in the list of retriable errors; but no, I don't really know what might be going on here aside from network data corruption (which ought to be vanishingly rare over SSL).

-edit-

use_ssl=False is a requirement?

@carshadi
Copy link
Contributor

carshadi commented Mar 7, 2024

Hi @martindurant ,

use_ssl=False is not a requirement, but setting to True did not prevent the error.

We tried adding this and a more rare (but still novel to us) "400 Bad Request" error to the retriable errors, and it let the 120TB upload job finish successfully. See carshadi@4b305f7

We first noticed both of these errors on 2/27, the uploads had been working flawlessly prior.

The job uses a local SLURM cluster to upload an OME-Zarr dataset in parallel to S3 using zarr-python + dask.
We used 52 compute nodes, with 256 cpus total, for the job, and all nodes encountered the error multiple times, but there appears to be some structure at least in the beginning of the job.

483 occurrences of the hash error and only 1 bad request over a ~8 hour period.

image

@carshadi
Copy link
Contributor

@martindurant update on this, it turns out that use_ssl=True does indeed resolve the error, sorry about the mishap. The weird thing is that we had been using use_ssl=False since November without issue. Do you think it makes sense to add these errors to the retry cases?

@martindurant
Copy link
Member

it makes sense to add these errors to the retry cases

Yes I think so - your information confirms that this is essentially a connection error like any other. I cannot think of a reason it should only be appearing recently.

@martindurant
Copy link
Member

Do you think it makes sense to add these errors to the retry cases?

Sorry, I didn't clarify: do you plan to make this PR?

@carshadi
Copy link
Contributor

Hi @martindurant , yes I just opened a PR. Feedback appreciated, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants