Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r.patch: Fix gathering statistics with multi-threading #2906

Merged
merged 12 commits into from
May 1, 2023

Conversation

jameshu15869
Copy link
Contributor

@jameshu15869 jameshu15869 commented Mar 28, 2023

Fixes #2411. Adds tests.

Original description:

This is my current draft implementation for making an array of statf structures for each thread.

Null count is being counted correctly in each thread and the values add up to what is expected, but I am unsure how to proceed for merging the trees.

Before merging:

  • add test specifically testing CELL rasters
  • run benchmark, compare with and without -s flag
  • run valgrind

@petrasovaa petrasovaa marked this pull request as draft March 28, 2023 14:59
raster/r.patch/main.c Outdated Show resolved Hide resolved
raster/r.patch/main.c Outdated Show resolved Hide resolved
@jameshu15869 jameshu15869 requested a review from petrasovaa March 28, 2023 15:45
@HuidaeCho
Copy link
Member

HuidaeCho commented Mar 28, 2023

I think this race condition was introduced by parallelism, but I haven't tried it yet. Looking at the code, have you tried:

#pragma omp parallel private(i, row) reduction(+:computed) if (nprocs > 1)

in line 208 in main.c? row should be private to avoid data races and computed should be reduced by addition. I think we can remove line 250 #pragma omp atomic update with this reduction.

@HuidaeCho HuidaeCho added bug Something isn't working raster Related to raster data processing labels Mar 28, 2023
@petrasovaa petrasovaa removed the bug Something isn't working label Mar 28, 2023
@petrasovaa
Copy link
Contributor

I think this race condition was introduced by parallelism, but I haven't tried it yet. Looking at the code, have you tried:

#pragma omp parallel private(i, row) reduction(+:computed) if (nprocs > 1)

in line 208 in main.c? row should be private to avoid data races and computed should be reduced by addition. I think we can remove line 250 #pragma omp atomic update with this reduction.

Please see the original problem in #2410 and the issue #2411 this PR is addressing. What you suggest may be a good improvement, but AFAIU it's not really related to this PR. We had an initial conversation about this on gitter.

@HuidaeCho
Copy link
Member

HuidaeCho commented Mar 28, 2023

I think this race condition was introduced by parallelism, but I haven't tried it yet. Looking at the code, have you tried:

#pragma omp parallel private(i, row) reduction(+:computed) if (nprocs > 1)

in line 208 in main.c? row should be private to avoid data races and computed should be reduced by addition. I think we can remove line 250 #pragma omp atomic update with this reduction.

Please see the original problem in #2410 and the issue #2411 this PR is addressing. What you suggest may be a good improvement, but AFAIU it's not really related to this PR. We had an initial conversation about this on gitter.

Oh, it was a different problem! Sorry for the noise.

@jameshu15869
Copy link
Contributor Author

jameshu15869 commented Mar 28, 2023

@petrasovaa
I have finished an initial draft implementation that that merges the statf[] array for each thread into the thread_statf[0] array (The statf array in Thread 0) before directly passing to support. I chose this approach because I did not want to modify support.c and introduce more possible problems. I've ran the new implemenation with nprocs = 1, 2, and 4 and the output on the GUI seems correct. r.category also produces the expected categories, though I'm not sure this is a surefire test because categories does not seem to rely on the stats structure. Currently I'm facing 2 obstacles.

First, the null_data_count values are not being merged at all - though the values in each thread add up to the correct number. Should Rast_update_cell_stats automatically merge these values? If not I can easily go back and manually update null_data_count when I'm merging the thread statf's. Based on the code it looks like Rast_update_cell_stats is modifying the null_data_count value but I am not seeing the change.

Second, I tried printing out the nodes of the *node for each statf object. Running nprocs=1 and nprocs=2 produce the right number of nodes and N, but it seems they are at different indexes and have different left and right attributes. I'm assuming this is merging from multiple threads would place the nodes at different locations. Is this a reasonable assumption? Or should we expect the contents of statf to be exactly the same after merging?

Would you be able to take a look? Thanks!

raster/r.patch/main.c Outdated Show resolved Hide resolved
raster/r.patch/main.c Outdated Show resolved Hide resolved
raster/r.patch/main.c Outdated Show resolved Hide resolved
Copy link
Contributor

@petrasovaa petrasovaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, I think this is close to finished! Please cleanup the code (address the issues, remove comments, old statf, print functions). The new code for merging should be ideally in a separate function (in support.c).

Before merging this, we should also write test and run benchmark to ensure this is correct and to see what impact this has on performance. But at this point I suggest you to finish the cleanup and then we brainstorm (on Gitter) the GSoC project and proposal if you are interested, the deadline is Apr 4th. This PR could be finalized as part of the project.

@jameshu15869
Copy link
Contributor Author

@petrasovaa
I've gone back and cleaned up the code. I also created a merge_threads function in support that will actually perform the merge. Regarding testing, I found a test in testsuite/test_rpatch_artificial.py and my implementation passed. I tried to run the benchmark file, but got an error. Is it possible that this error came from the benchmark itself and not my implementation?

Traceback (most recent call last):
  File "benchmark/benchmark_r_patch.py", line 52, in generate_map
    Module(
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 610, in __init__
    self.__call__(*args, **kargs)
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 634, in __call__
    return self.run()
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 823, in run
    self.wait()
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 844, in wait
    raise CalledModuleError(
grass.exceptions.CalledModuleError: Module run `r.surf.fractal dimension=2.05 number=0 output=r_patch_reference_tmp --o` ended with an error.
The subprocess ended with a non-zero return code: -9. See errors above the traceback or in the error output.

Please let me know if there's anything else I should add to this PR. Thanks!

@petrasovaa
Copy link
Contributor

@petrasovaa I've gone back and cleaned up the code. I also created a merge_threads function in support that will actually perform the merge. Regarding testing, I found a test in testsuite/test_rpatch_artificial.py and my implementation passed. I tried to run the benchmark file, but got an error. Is it possible that this error came from the benchmark itself and not my implementation?

Traceback (most recent call last):
  File "benchmark/benchmark_r_patch.py", line 52, in generate_map
    Module(
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 610, in __init__
    self.__call__(*args, **kargs)
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 634, in __call__
    return self.run()
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 823, in run
    self.wait()
  File "/usr/local/grass83/etc/python/grass/pygrass/modules/interface/module.py", line 844, in wait
    raise CalledModuleError(
grass.exceptions.CalledModuleError: Module run `r.surf.fractal dimension=2.05 number=0 output=r_patch_reference_tmp --o` ended with an error.
The subprocess ended with a non-zero return code: -9. See errors above the traceback or in the error output.

Looking at the code of the benchmark, there is a fallback if r.surf.fractal fails that runs r.random.surface instead, did the fallback work for you? I think the fallback is there because r.surf.fractal might require lot of memory, is it possible you are running out of memory? Try to lower the size or raster and see if that helps.

@jameshu15869
Copy link
Contributor Author

jameshu15869 commented Mar 30, 2023

Looking at the code of the benchmark, there is a fallback if r.surf.fractal fails that runs r.random.surface instead, did the fallback work for you? I think the fallback is there because r.surf.fractal might require lot of memory, is it possible you are running out of memory? Try to lower the size or raster and see if that helps.

@petrasovaa
I found the error - when attempting to generate a map with size 4000, both functions require around 8.5 GB of memory. My computer only has 8 GB so I suspect this is why r.surf.fractal and the fallback r.random.surface both failed. However, I was able to run benchmarks for size 7071, 14142, and 10000. The graph is attached below.

benchmarking without 2000

My laptop only has 4 threads, so should I ignore the time increase as threads increase? My best guess is that the program does not know how to assign 12 threads to 4 threads, for example. Other than that, this graph shows that the OpenMP does make a difference, right?

As a side note, running the benchmarks takes around 30-40 minutes for each run on my laptop. Is there a way to reduce the number of threads being tested so only nprocs=1,2,4 are run?

Thanks!

@neteler
Copy link
Member

neteler commented Mar 31, 2023

My computer only has 8 GB

FWIW, you can use a swap file (!) to "enlarge" the RAM if you have some spare GB of disk space:

# enable SWAP FILE, for more RAM
#
# forgot to create a swap partition at installation?
# solution: create a swap *file* (not partition):

sudo su

SWAPDIR=/home/swapfiledir  # use your preferred path here
GB=10  # extra "RAM" in Gigabytes

mkdir ${SWAPDIR}
echo "Generating <${SWAPDIR}/swap_file>...:"
dd if=/dev/zero of=${SWAPDIR}/swap_file bs=1M count=${GB}k
ls -la ${SWAPDIR}/swap_file
df -h ${SWAPDIR}/

chmod 0600 ${SWAPDIR}/swap_file
mkswap ${SWAPDIR}/swap_file

# add following entry for enabling at boot:
echo "${SWAPDIR}/swap_file swap swap defaults 0 0"

# edit /etc/fstab and add the swap_file line from above
vim /etc/fstab

# enable swap
swapon -a

# verify new RAM size with e.g. free or top
free
top

@petrasovaa
Copy link
Contributor

FWIW, you can use a swap file (!) to "enlarge" the RAM if you have some spare GB of disk space:

Good to know, but I wonder if this affects the performance.

@petrasovaa
Copy link
Contributor

Looking at the code of the benchmark, there is a fallback if r.surf.fractal fails that runs r.random.surface instead, did the fallback work for you? I think the fallback is there because r.surf.fractal might require lot of memory, is it possible you are running out of memory? Try to lower the size or raster and see if that helps.

@petrasovaa I found the error - when attempting to generate a map with size 4000, both functions require around 8.5 GB of memory. My computer only has 8 GB so I suspect this is why r.surf.fractal and the fallback r.random.surface both failed. However, I was able to run benchmarks for size 7071, 14142, and 10000. The graph is attached below.

I will try to run the benchmark myself, I have plenty of cores and RAM, so at this point don't worry about this. But it's good you were able to test this!

@petrasovaa petrasovaa added this to the 8.4.0 milestone Mar 31, 2023
@echoix
Copy link
Member

echoix commented Mar 31, 2023

FWIW, you can use a swap file (!) to "enlarge" the RAM if you have some spare GB of disk space:

Good to know, but I wonder if this affects the performance.

I'd assume a performance hit when using swap instead of RAM of course, thus the results you get shouldn't be considered "performance" metrics, but at least you are able to run it. The performance of the algorithm you are tested will depend on what other tasks are running on the system (and their memory usage, if they swap), and the disk I/O performance of your system.

If a final performance chart is wanted, maybe get someone with a better suited system to run the tests.

@jameshu15869
Copy link
Contributor Author

Sounds good. For now I will plan to ask others to help out with the testing.

One other thing I noticed was that the GitHub check is currently failing for Ubuntu on the test for t.rast.algebra. The error message is

Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
WARNING: Temporal database connection defined as:
/home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az574_655_32269/tgis/sqlite.db
But database file does not exist.
Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az574_655_32269/tgis/sqlite.db

However, the Ubutnu 22.04 test passed for one of my previous commits (2d2daa0). It seems that the only change between that commit and the most recent failing commits is running the additional tests. I quickly scanned through the t.rast.algebra test and it doesn't seem like any of the code I worked on should be tested in this file. Could there be an issue with the GitHub CI itself? Or is there something else going on?

Thanks!

@jameshu15869 jameshu15869 marked this pull request as ready for review April 3, 2023 00:28
@petrasovaa
Copy link
Contributor

through the t.rast.algebra test and it doesn't seem like any of the code I worked on should be tested in this file. Could there be an issue with the GitHub CI itself? Or is there something else going on?

Yes, that's unrelated, sometimes it happens we are unsure why, usually we would rerun the workflow.

Copy link
Contributor Author

@jameshu15869 jameshu15869 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes have been resolved and I believe the PR is ready for a final review.

Copy link
Contributor Author

@jameshu15869 jameshu15869 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the files have been reviewed, but I'm not sure how to get rid of the "merging block"

@jameshu15869 jameshu15869 requested a review from petrasovaa April 12, 2023 02:34
Copy link
Contributor

@petrasovaa petrasovaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but I still need to run the benchmarks (I added tasks in the PR description), will get to it hopefully next week. You don't need to do anything right now.

@jameshu15869
Copy link
Contributor Author

The code looks good, but I still need to run the benchmarks (I added tasks in the PR description), will get to it hopefully next week. You don't need to do anything right now.

Sounds good. Thank you so much for all your help!

@petrasovaa
Copy link
Contributor

I ran a benchmark with and without s flag:

with s flag:
rpatch_benchmark_s

without (default):
rpatch_benchmark

So running with s flag is slightly faster, but that's expected. Otherwise the benchmark seems fine.

@jameshu15869
Copy link
Contributor Author

@petrasovaa
I'm glad that it works! Did testing reveal any issues with the implementation that need to be addressed?

@petrasovaa
Copy link
Contributor

@petrasovaa I'm glad that it works! Did testing reveal any issues with the implementation that need to be addressed?

I think it's all fine, ready to be merged. The failing mac test is unrelated.

But could you please look at Huidae's suggestion above and see if that works and if it improves anything? If yes, that would be a different PR.

@jameshu15869
Copy link
Contributor Author

jameshu15869 commented Apr 28, 2023

I think it's all fine, ready to be merged. The failing mac test is unrelated.

But could you please look at Huidae's suggestion above and see if that works and if it improves anything? If yes, that would be a different PR.

@HuidaeCho @petrasovaa
I've implemented Huidae's suggestion and modified the old #pragma omp atomic update and this was the graph I got after running only the 50M r.patch benchmark (Since I am working at school on my laptop). I was also able to pass all 4 test cases in testsuite/test_rpatch_artificial.py. It seems that there was some minor improvement, though I'm not exactly sure how to quantify it. I haven't committed the changes yet since Anna mentioned that this could be a separate PR. Should I move these onto a new branch/open a new PR specifically for this issue?

rpatch_benchmark

@petrasovaa
Copy link
Contributor

It seems that there was some minor improvement, though I'm not exactly sure how to quantify it. I haven't committed the changes yet since Anna mentioned that this could be a separate PR. Should I move these onto a new branch/open a new PR specifically for this issue?

Yes, please create a new PR and I will look at it. I will merge this one after the checks are done.

@jameshu15869
Copy link
Contributor Author

I've opened a new pull request for the reduction (#2941). However, it looks like all of my changes from this PR are still on the new PR. Would this potentially become an issue later on?

@petrasovaa petrasovaa merged commit 937cf6c into OSGeo:main May 1, 2023
@petrasovaa
Copy link
Contributor

I've opened a new pull request for the reduction (#2941). However, it looks like all of my changes from this PR are still on the new PR. Would this potentially become an issue later on?

Right, this needs to be fixed. I am not sure what is the easiest way, you can update your main, create a new branch, put your changes there and the force push (something like git push -f origin branch-with-new-changes:branch-you-are-pushing-to)

@jameshu15869 jameshu15869 changed the title DRAFT: r.patch race condition r.patch race condition May 2, 2023
@jameshu15869 jameshu15869 deleted the r.patch-race-condition branch May 2, 2023 03:09
@wenzeslaus wenzeslaus changed the title r.patch race condition r.patch: Fix gathering statistics with multi-threading Jun 6, 2023
neteler pushed a commit to nilason/grass that referenced this pull request Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
raster Related to raster data processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feat] r.patch: collect and apply statistics for CELL rasters with nprocs > 1
5 participants