Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check environment supports target device in Dataset constructor #243

Merged
merged 12 commits into from
Mar 22, 2023

Conversation

oliverholworthy
Copy link
Member

@oliverholworthy oliverholworthy commented Mar 13, 2023

Add checks of the environment to make sure the target device is supported based on the cpu parameter.

Current behaviour

cpu=False is treated as equivalent to cpu=None.

  • cpu=None or cpu=False
    • will use GPU device if cudf available, otherwise host CPU
  • cpu = True
    • use CPU host device

After this PR

cpu=False is changed to mean we'd like to use the GPU device only and if for some reason we can't, we'd like to know about this and why.

  • cpu = None
    • will use GPU device if available
  • cpu = False
    • use GPU device
    • raises RuntimeError if either device not detected with pynvml or cudf is not available.
  • cpu = True
    • use CPU host device

Motivation

Making device specified for Dataset clearer.

Introducing earlier failure in the case of missing cudf or device availability reduces the chance of subsequent later on when we try to use functions in merlin.core.dispatch that then fail with an error that may be less clear.

@oliverholworthy oliverholworthy added the bug Something isn't working label Mar 13, 2023
@oliverholworthy oliverholworthy added this to the Merlin 23.03 milestone Mar 13, 2023
@oliverholworthy oliverholworthy self-assigned this Mar 13, 2023
@oliverholworthy oliverholworthy changed the title Use HAS_GPU as part of Dataset device choice Check environment supports target device in Dataset constructor Mar 13, 2023
@oliverholworthy oliverholworthy added enhancement New feature or request and removed bug Something isn't working labels Mar 13, 2023
@karlhigley
Copy link
Contributor

This LGTM, but wonder if it’s worth adding some tests? They wouldn’t all be able to run in every environment but between the CPU and GPU CI, we should be able to test the four (?) possibilities.

@karlhigley
Copy link
Contributor

I can't completely tell if the downstream failures are related, but they look like they could be

@oliverholworthy
Copy link
Member Author

I can't completely tell if the downstream failures are related, but they look like they could be

Yes, the errors in NVTabular/dataloader are related. There are two more places that need to change (both are from test code where we pass cpu=False in environment without GPUs):

@karlhigley
Copy link
Contributor

Looks like there are still some downstream issues in NVT and the dataloaders, but less than there used to be 👍🏻

karlhigley added a commit that referenced this pull request Mar 22, 2023
…n detected (#236)

* Run tests in GPU environment with no GPUs visible

* Update TensorTable tests with checks for HAS_GPU

* Remove unused `_HAS_GPU` variable from `test_utils`

* Wrap cupy/cudf imports in HAS_GPU check in `compat`

* Update tests to use HAS_GPU from compat module

* Reformat test_tensor_table.py

* Move HAS_GPU import to compat module

* Add pynvml dependency

* Update functions in `dispatch` to not use HAS_GPU

* Raise RuntimeError in Dataset if we can't run on GPU when cpu=False

* Update `convert_data` to handle unavailable cudf and dask_cudf

* Remove use of `HAS_GPU` from dispatch

* Keep cudf and cupy values representing presence of package

* Revert changes to `dataset.py`. Now part of #243

* Revert changes to `dispatch.py`. Now part of #244

* Use branch-name action for branch selection

* Remove unused ref_type variable

* Extend reason in `test_tensor_column.py`

Co-authored-by: Karl Higley <kmhigley@gmail.com>

* Extend reason in `tests/unit/table/test_tensor_column.py`

Co-authored-by: Karl Higley <kmhigley@gmail.com>

* Remove cudf import from compat. Now unrelated to this PR

* Remove use of branch-name action. `docker` not available in runner

* Add HAS_GPU checks with cupy to support env without visible devices

* Correct value of empty visible devices

* Update deps for GPU envs to match others

* Update get_lib to account for missing visible GPU

* Check HAS_GPU in `make_df` to handle visible GPU devices

* Update Dataset to handle default case when no visible GPUs are found

* Update fixtures to handle cudf with no visible devices

* Update tests to handle case of no visible GPUs

---------

Co-authored-by: Karl Higley <kmhigley@gmail.com>
Co-authored-by: Karl Higley <karlb@nvidia.com>
@karlhigley karlhigley merged commit 9d1467c into NVIDIA-Merlin:main Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants