Multiple PyTorch engines across threads appear to be sharing native instance #2825

tklinchik · 2023-10-30T02:53:48Z

Description

Running on a very lager server with lots of cores and using PyTorch engine on CPU I'm trying to parallelize very much independent jobs across multiple instances of PtEngine/NDManager allocated one per thread.
I assumed each engine was independent of one another and was setting environment variable "ai.djl.pytorch.num_interop_threads" to limit number of threads to 1 and got the following error message on when creating subsequent NDManager instances.
It appears as if underlying PtEngine created using PyTorchLibrary is shared as subsequent creation of NDManager appears to throw an exception with below error.
I couldn't find any documentation on how exactly resources are shared across thread in the same JVM/ClassLoader and would appreciate some guidance on this.

Expected Behavior

Each PyTorch engine instance to be completely independent of one another

Error Message

ai.djl.engine.EngineException: Error: cannot set number of interop threads after parallel work has started or set_num_interop_threads called
at ai.djl.pytorch.jni.PyTorchLibrary.torchSetNumInteropThreads(Native Method)
at ai.djl.pytorch.jni.JniUtils.setNumInteropThreads(JniUtils.java:102)
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:56)
at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40)
at ai.djl.engine.Engine.getEngine(Engine.java:190)
at ai.djl.engine.Engine.getInstance(Engine.java:145)
at ai.djl.ndarray.NDManager.newBaseManager(NDManager.java:120)

How to Reproduce?

Set the following property and create NDManager per thread:

System.setProperty("ai.djl.pytorch.num_interop_threads", "1")

The text was updated successfully, but these errors were encountered:

frankfliu · 2023-10-30T02:58:29Z

PyTorch interop thread and intraop thread are global settings. You should not set change it at runtime.

Our recommendation is to set both of them to 1 at the beginning.

frankfliu · 2023-10-30T03:04:34Z

PtEngine is a singleton, It only initialized once. Are you loading it in different ClassLoader?

tklinchik · 2023-10-30T03:09:30Z

No, all are in the same class loader. I'm calling NDManager.newBaseManager() per thread which ends up invoking PtEngineProvider.getEngine(), etc

frankfliu · 2023-10-30T03:17:04Z

Can you initialize PtEngine before you start the thread? It looks like there is bug in getEngine() call.

frankfliu · 2023-10-30T03:38:26Z

I created a PR to address your issue: #2826

tklinchik · 2023-10-30T22:24:24Z

Can you initialize PtEngine before you start the thread? It looks like there is bug in getEngine() call.

That seems to have worked without issues.
I see you already have a fix. Appreciate your help fixing this bug.

tklinchik · 2024-01-25T00:20:56Z

After upgrading to the latest and removing previously suggested workaround I'm getting a different error when I'm instantiating NDManager in each thread:

Caused by: java.lang.IllegalStateException: The engine PyTorch was not able to initialize
	at ai.djl.engine.Engine.getEngine(Engine.java:218)
	at ai.djl.engine.Engine.getInstance(Engine.java:149)
	at ai.djl.ndarray.NDManager.newBaseManager(NDManager.java:120)

tklinchik added the bug Something isn't working label Oct 30, 2023

frankfliu closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple PyTorch engines across threads appear to be sharing native instance #2825

Multiple PyTorch engines across threads appear to be sharing native instance #2825

tklinchik commented Oct 30, 2023

frankfliu commented Oct 30, 2023

frankfliu commented Oct 30, 2023

tklinchik commented Oct 30, 2023

frankfliu commented Oct 30, 2023

frankfliu commented Oct 30, 2023

tklinchik commented Oct 30, 2023

tklinchik commented Jan 25, 2024

Multiple PyTorch engines across threads appear to be sharing native instance #2825

Multiple PyTorch engines across threads appear to be sharing native instance #2825

Comments

tklinchik commented Oct 30, 2023

Description

Expected Behavior

Error Message

How to Reproduce?

frankfliu commented Oct 30, 2023

frankfliu commented Oct 30, 2023

tklinchik commented Oct 30, 2023

frankfliu commented Oct 30, 2023

frankfliu commented Oct 30, 2023

tklinchik commented Oct 30, 2023

tklinchik commented Jan 25, 2024