Different Convergence Behavior Between GPU and CPU Runs #520

tan-nguyenxuan · 2025-02-16T12:17:40Z

tan-nguyenxuan
Feb 16, 2025

Hi,

I tried running Meridian with a dataset consisting of 177 weeks, 12 channels, 1 controls. However, I encountered an issue. When running with the following parameters as per the guidelines:
%%time
mmm.sample_prior(500)
mmm.sample_posterior(n_chains=7, n_adapt=500, n_burnin=500, n_keep=1000)

I found that running Meridian on a GPU did not converge, whereas running on a CPU resulted in convergence. The parameter settings in both cases were exactly the same.
model_diagnostics = visualizer.ModelDiagnostics(mmm)
model_diagnostics.plot_rhat_boxplot()

I suspect this issue occurs because the initialization of the first parameter set for the MCMC algorithm differs between GPU and CPU runs.

I would appreciate an explanation for this behavior. Can I trust the results from the converged model in the CPU run?

Thanks in advance for your help!

cpulavarthi · 2025-02-19T05:14:52Z

cpulavarthi
Feb 19, 2025
Collaborator

Hello @tan-nguyenxuan,

Thank you for contacting us!

The behaviour you mentioned is unexpected. To identify the root cause of this issue, please share the details of the GPU you are using and the version of TensorFlow installed in your environment.

Please also check our documentation on Debugging MCMC Convergence issues which lists some debugging steps that may help resolve this problem.

Please share the requested information and feel free to reach out if you need any further assistance.

Thank you

Google Meridian Support Team

2 replies

tan-nguyenxuan Feb 19, 2025
Author

Thank you. I am using CPU and T4 GPU on Colab with TensorFlow version 2.16.2

cpulavarthi Feb 27, 2025
Collaborator

Hello @tan-nguyenxuan,

Thank you for sharing the details requested. I have tried replicating your issues with differences in convergence behaviour but am unable to do so using the demo dataset.

Can you share more information regarding the dataset you are using? Are you using holdout sets during model training? Please also try reproducing this issue with a seed set during model training to further debug this (sample_prior and sample_posterior methods have this argument to do so). If possible, please also share the code you are using.

Thank you

Google Meridian Support Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different Convergence Behavior Between GPU and CPU Runs #520

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Different Convergence Behavior Between GPU and CPU Runs #520

tan-nguyenxuan Feb 16, 2025

Replies: 1 comment · 2 replies

cpulavarthi Feb 19, 2025 Collaborator

tan-nguyenxuan Feb 19, 2025 Author

cpulavarthi Feb 27, 2025 Collaborator

tan-nguyenxuan
Feb 16, 2025

Replies: 1 comment 2 replies

cpulavarthi
Feb 19, 2025
Collaborator

tan-nguyenxuan Feb 19, 2025
Author

cpulavarthi Feb 27, 2025
Collaborator