Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix certain hash miss matches in case of non weel installations #7594

Conversation

quantum-byte
Copy link
Contributor

@quantum-byte quantum-byte commented Mar 2, 2023

Addressed Issue:

  1. Have a a package which provides both wheel and non wheel archives (like a tar.gz)
  2. Force the use of the non wheel archive during installation:
export POETRY_INSTALLER_NO_BINARY=dagster-docker; poetry install dagster-docker
  1. Get error like this:
 RuntimeError

  Hash for dagster-docker (0.17.20) from archive dagster_docker-0.17.20-py3-none-any.whl not found in known hashes (was: sha256:f0369c37d4328bd140c727494f65b88c49df0999c1981912bc68274e8412d793)

This is due to the original wheel archive from the registry being overwritten in the poetry artifact cache by the newly created wheel archive from the sdist tar.gz archive. The new wheel archive though has minimal changes (bdist version etc.) and causes the sha265 hash miss match. (Overwrite does not happen but the wrong hash is checked. See comment )

Changes to fix the issue:

  • Adjusted Chef.prepare() to always write to a temporary directory. Otherwise we would overwrite an already existing cached wheel. This could then make the installation fail due to miss matched hashes.
  • Changed Executor._download_link() to validate package hashes via _populate_hashes_dict before building a new wheel in case we use a non wheel archive.

This eliminates the caching of the built wheel for non wheel archives but i did not see an easy way to keep this caching and fix the issue. Do you think this will impact performance in a meaningfull way ?

Pull Request Check List

Resolves: Unclear - I have not created a specific issue for my problem but it could address some issues related to hash miss matches.

  • Added tests for changed code.
  • Updated documentation for changed code.

@radoering
Copy link
Member

This eliminates the caching of the built wheel for non wheel archives but i did not see an easy way to keep this caching and fix the issue. Do you think this will impact performance in a meaningfull way ?

It will have a significant impact.

Copy link
Member

@radoering radoering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about the issue and trying a few things, I suspect the solution is much simpler: Don't move the _populate_hashes_dict(), but make it conditional:

if archive.suffix != ".whl":
    ...

elif link.is_wheel:
    self._populate_hashes_dict(archive, package)

We only want to populate the dict with (and check the hash of) the wheel, if it was downloaded from somewhere and not built from an sdist by poetry.

@radoering
Copy link
Member

This is due to the original wheel archive from the registry being overwritten in the poetry artifact cache by the newly created wheel archive from the sdist tar.gz archive.

By the way, that's not true. Example:

Cache after installing tomli from wheel:

$ find ~/.cache/pypoetry/artifacts -name tomli*
/home/randy/.cache/pypoetry/artifacts/62/12/b6/6db9ebb9c8e1a6c5aa8a92ae73098d8119816b5e8507490916621bc305/tomli-2.0.1-py3-none-any.whl

Cache after installing tomli from sdist afterwards:

$ find ~/.cache/pypoetry/artifacts -name tomli*
/home/randy/.cache/pypoetry/artifacts/aa/63/d4/79512b2ef845b845ecbc074b5689753b3fc1b4ec2048cfff310fb36ea0/tomli-2.0.1-py3-none-any.whl
/home/randy/.cache/pypoetry/artifacts/aa/63/d4/79512b2ef845b845ecbc074b5689753b3fc1b4ec2048cfff310fb36ea0/tomli-2.0.1.tar.gz
/home/randy/.cache/pypoetry/artifacts/62/12/b6/6db9ebb9c8e1a6c5aa8a92ae73098d8119816b5e8507490916621bc305/tomli-2.0.1-py3-none-any.whl

Artifact cache folders are per link. A downloaded sdist and a downloaded wheel have different links and thus are stored in different folders. The self-built wheel is put in the same folder as the sdist.

@radoering radoering added the area/installer Related to the dependency installer label Mar 4, 2023
@quantum-byte
Copy link
Contributor Author

quantum-byte commented Mar 4, 2023

This is due to the original wheel archive from the registry being overwritten in the poetry artifact cache by the newly created wheel archive from the sdist tar.gz archive.

By the way, that's not true. Example:

Wow my bad. I checked the file hashes for the filename but i missed comparing the path that it was actually the same file that i was looking at. I even saw the url logic for the cache path generation 😅. So much for jumping to conclusions.

This makes your solution work and we get the caching for generated wheels back 🥳.

Though i am not sure if its a good idea to skip validation for everything besides wheels. Should poetry not validate the downloaded and (cached) tar.gz file if there is a hash from the repo available ? Or is that done somewhere else that i did not see.

I provided a solution which still includes validation of non_wheel archives if possible. But then i am not sure what hash should be recorded in this case (the original e.g. tar.gz hash, the generated whl hash or none) ?

@radoering
Copy link
Member

Though i am not sure if its a good idea to skip validation for everything besides wheels. Should poetry not validate the downloaded and (cached) tar.gz file if there is a hash from the repo available ? Or is that done somewhere else that i did not see.

I provided a solution which still includes validation of non_wheel archives if possible. But then i am not sure what hash should be recorded in this case (the original e.g. tar.gz hash, the generated whl hash or none) ?

Good point. I've noticed this too, but would have put it off until later if you hadn't brought it up. I think you are on the right track so we can also tackle it now. 😄 It does not make sense to record the hash of the generated wheel because this hash is not known by anybody else and the wheel is only intermediate. We always have to record the original url and hash (tar.gz in this case).

If I see it correctly, there is still one flaw: The hash of the tar.gz is only recorded if there is not yet a cached wheel. In order to fix this, I assume we need two versions of get_cached_archive_for_link(): one that prefers cached wheels and one that only returns the original archive. So maybe, let's introduce a parameter strict where strict=False is the current implementation and strict=True returns the original archive or None if it is not cached.

Then, we can use strict=True in

archive = self._chef.get_cached_archive_for_link(link)

afterwards always call _populate_hashes_dict() (because we always have the original distribution) and at last call get_cached_archive_for_link() with strict=False to get a cached wheel if there exists one.

@quantum-byte
Copy link
Contributor Author

Though i am not sure if its a good idea to skip validation for everything besides wheels. Should poetry not validate the downloaded and (cached) tar.gz file if there is a hash from the repo available ? Or is that done somewhere else that i did not see.
I provided a solution which still includes validation of non_wheel archives if possible. But then i am not sure what hash should be recorded in this case (the original e.g. tar.gz hash, the generated whl hash or none) ?

Good point. I've noticed this too, but would have put it off until later if you hadn't brought it up. I think you are on the right track so we can also tackle it now. smile It does not make sense to record the hash of the generated wheel because this hash is not known by anybody else and the wheel is only intermediate. We always have to record the original url and hash (tar.gz in this case).

If I see it correctly, there is still one flaw: The hash of the tar.gz is only recorded if there is not yet a cached wheel. In order to fix this, I assume we need two versions of get_cached_archive_for_link(): one that prefers cached wheels and one that only returns the original archive. So maybe, let's introduce a parameter strict where strict=False is the current implementation and strict=True returns the original archive or None if it is not cached.

Ok that looks like a good idea. I dont have the time today but will have a look tomorrow to get a implementation setup up and improve the tests to cover it.

@quantum-byte quantum-byte force-pushed the fix-non-wheel-build-artifact-overwrite branch from 993361d to 21d6675 Compare March 5, 2023 22:27
…herwise we would overwrite an already existing cached wheel. This could then make the installation fail due to missmatched hashes.

Changed Executor._download_link() to validate package hashes via _populate_hashes_dict before building a new wheel incase we use a non wheel archive.
@quantum-byte quantum-byte force-pushed the fix-non-wheel-build-artifact-overwrite branch from a547904 to 87181dc Compare March 5, 2023 22:31
@quantum-byte
Copy link
Contributor Author

@radoering I did find some time already and the implementation including tests is complete as far as i can see. Let me know if there is still something missing and how to proceed from here :).

src/poetry/installation/executor.py Outdated Show resolved Hide resolved
src/poetry/installation/chef.py Outdated Show resolved Hide resolved
src/poetry/installation/executor.py Outdated Show resolved Hide resolved
tests/installation/test_executor.py Outdated Show resolved Hide resolved
tests/installation/test_executor.py Outdated Show resolved Hide resolved
@quantum-byte quantum-byte requested a review from radoering March 6, 2023 10:23
tests/installation/test_executor.py Outdated Show resolved Hide resolved
tests/installation/test_executor.py Outdated Show resolved Hide resolved
Co-authored-by: Randy Döring <30527984+radoering@users.noreply.github.com>
tests/installation/test_chef.py Outdated Show resolved Hide resolved
@radoering radoering added impact/backport Requires backport to stable branch backport/1.4 impact/changelog Requires a changelog entry labels Mar 7, 2023
@radoering radoering enabled auto-merge (squash) March 7, 2023 16:14
@radoering radoering merged commit 26fbfd6 into python-poetry:master Mar 7, 2023
poetry-bot bot pushed a commit that referenced this pull request Mar 7, 2023
* fix hash mismatch during installation with `no-binary` option for packages that provide suitable wheels
* do not skip hash check for sdists
* write hash for sdists in direct_url.json

(cherry picked from commit 26fbfd6)
@quantum-byte quantum-byte deleted the fix-non-wheel-build-artifact-overwrite branch March 8, 2023 03:36
radoering pushed a commit that referenced this pull request Mar 8, 2023
* fix hash mismatch during installation with `no-binary` option for packages that provide suitable wheels
* do not skip hash check for sdists
* write hash for sdists in direct_url.json

(cherry picked from commit 26fbfd6)
radoering pushed a commit that referenced this pull request Mar 8, 2023
* fix hash mismatch during installation with `no-binary` option for packages that provide suitable wheels
* do not skip hash check for sdists
* write hash for sdists in direct_url.json

(cherry picked from commit 26fbfd6)
@JoseHernandezTR
Copy link

Hello, would anything related to this fix cause this side effect?

_WheelFileValidationError
67
68 ["In /root/.cache/pypoetry/artifacts/23/11/79/b24ac52cf28d681dbe7414498ba63b1521494ec3907b422567db0dd630/pydata_sphinx_theme-0.13.1-py3-none-any.whl, hash / size of pydata_sphinx_theme/init.py didn't match RECORD", "In /root/.cache/pypoetry/artifacts/23/11/79/b24ac52cf28d681dbe7414498ba63b1521494ec3907b422567db0dd630/pydata_sphinx_theme-0.13.1-py3-none-any.whl, hash / size of pydata_sphinx_theme/assets/scripts/bootstrap.js didn't match RECORD"
...

This is confirmed to happen with 1.4.1 but not with 1.4.0. Thanks!

@heimalne
Copy link

Probably, here's the issue #7691 and the relevant MR to fix it #7694

@ronlut
Copy link

ronlut commented Aug 16, 2023

@quantum-byte I am pretty sure this causes a bug in a real-world scenario where the cache is empty (a.k.a no tar.gz and no .whl cached).
What happens is:

  1. tar.gz is downloaded
  2. tar.gz is converted into a wheel, overriding the original tar.gz archive
  3. Trying to populate hashes dict with the original archive (tar.gz) hash, but the file doesn't exist anymore, causing the following error:
5  src/poetry/installation/executor.py:803 in _download_link
       801│ 
       802│         # Use the original archive to provide the correct hash.
     → 803│         self._populate_hashes_dict(original_archive, package)
       804│ 
       805│         return archive

   4  src/poetry/installation/executor.py:809 in _populate_hashes_dict
       807│     def _populate_hashes_dict(self, archive: Path, package: Package) -> None:
       808│         if package.files and archive.name in {f["file"] for f in package.files}:
     → 809│             archive_hash = self._validate_archive_hash(archive, package)
       810│             self._hashes[package.name] = archive_hash
       811│ 

   3  src/poetry/installation/executor.py:814 in _validate_archive_hash
       812│     @staticmethod
       813│     def _validate_archive_hash(archive: Path, package: Package) -> str:
     → 814│         archive_hash: str = "sha256:" + get_file_hash(archive)
       815│         known_hashes = {f["hash"] for f in package.files if f["file"] == archive.name}
       816│ 

   2  src/poetry/utils/helpers.py:244 in get_file_hash
       242│ def get_file_hash(path: Path, hash_name: str = "sha256") -> str:
       243│     h = hashlib.new(hash_name)
     → 244│     with path.open("rb") as fp:
       245│         for content in iter(lambda: fp.read(io.DEFAULT_BUFFER_SIZE), b""):
       246│             h.update(content)

   1  ~/.pyenv/versions/3.8.8/lib/python3.8/pathlib.py:1221 in open
       1219│         if self._closed:
       1220│             self._raise_closed()
     → 1221│         return io.open(self, mode, buffering, encoding, errors, newline,
       1222│                        opener=self._opener)
       1223│ 

  FileNotFoundError

  [Errno 2] No such file or directory: '/Users/ronlut/Library/Caches/pypoetry/artifacts/42/7d/d9/438df92a60e31c7ad0a69f3daad7bc7d645f0ce467376afad763bfb8d0/<REDUCTED>-7.0.9.tar.gz'

I will continue the discussion in the issue I opened, would love your help on solving it there 🙏

Issue: #8320

Copy link

github-actions bot commented Mar 3, 2024

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/installer Related to the dependency installer impact/backport Requires backport to stable branch impact/changelog Requires a changelog entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants