eQMARL: Entangled Quantum Multi-Agent Reinforcement Learning for Distributed Cooperation over Quantum Channels
This repository is the official implementation of "eQMARL: Entangled Quantum Multi-Agent Reinforcement Learning for Distributed Cooperation over Quantum Channels", published in the Thirteenth International Conference on Learning Representations (ICLR) 2025.
The codebase is provided as an installable Python package called eqmarl
. To install the package via pip
, you can run:
# Navigate to `eqmarl` source folder.
$ cd path/to/eqmarl/
# Install `eqmarl` package.
$ python -m pip install .
You can verify the package was successfully install by running:
$ python -c "import importlib.metadata; version=importlib.metadata.version('eqmarl'); print(version)"
1.0.0
If instead you just want to install the requirements without the package, you can run:
$ python -m pip install -r requirements.txt -r requirements-dev.txt
Installation of this repo can be little finicky because of the requirements for tensorflow-quantum
on various systems.
If you are using Anaconda to manage Python on macOS, be aware that the version of Python may have been built using an outdated version of macOS. To check this, you can run:
$ python -c "from distutils import util; print(util.get_platform())"
macosx-10.9-x86_64
Notice that in the above example we see the installation of Python was built against macosx-10.9-x86_64
, whereas the wheel for tensorflow-quantum
requires macosx-12.1-x86_64
or later.
To circumvent this, you can download the wheel for tensorflow-quantum==0.7.2
from here https://pypi.org/project/tensorflow-quantum/0.7.2/#files and change the name of the filename from tensorflow_quantum-0.7.2-cp39-cp39-macosx_12_1_x86_64.whl
to tensorflow_quantum-0.7.2-cp39-cp39-macosx_10_9_x86_64.whl
. Once you've done that you can install the wheel via:
# Activate your environment.
$ conda activate myenv
# Install wheel file manually.
$ python -m pip install tensorflow_quantum-0.7.2-cp39-cp39-macosx_10_9_x86_64.whl
To train using the frameworks in the paper, run this command:
$ python ./scripts/experiment_runner.py ./experiments/<experiment_name>.yml
This invokes the experiment_runner.py
script, which runs experiments based on YAML configurations.
Note that the option -r
/--n-train-rounds
can be used to train over multiple seed rounds (defaults to 1 round).
The experiment configuration for each of the frameworks discussed in the paper is described as a YAML file in the experiments folder.
The full list of experiments is as follows:
Experiment YAML File | Environment | Description |
---|---|---|
coingame_maa2c_mdp_eqmarl_noentanglement.yml |
MDP experiment using |
|
coingame_maa2c_mdp_eqmarl_phi+.yml |
MDP experiment using |
|
coingame_maa2c_mdp_eqmarl_phi-.yml |
MDP experiment using |
|
coingame_maa2c_mdp_eqmarl_psi+.yml |
MDP experiment using |
|
coingame_maa2c_mdp_eqmarl_psi-.yml |
MDP experiment using |
|
coingame_maa2c_mdp_fctde.yml |
MDP experiment using |
|
coingame_maa2c_mdp_qfctde.yml |
MDP experiment using |
|
coingame_maa2c_mdp_sctde.yml |
MDP experiment using |
|
coingame_maa2c_pomdp_eqmarl_noentanglement.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_eqmarl_phi+.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_eqmarl_phi-.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_eqmarl_psi+.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_eqmarl_psi-.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_fctde.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_qfctde.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_sctde.yml |
POMDP experiment using |
|
coingame_maa2c_mdp_eqmarl_psi+_L2.yml |
MDP experiment |
|
coingame_maa2c_mdp_eqmarl_psi+_L10.yml |
MDP experiment |
|
coingame_maa2c_mdp_qfctde_L2.yml |
MDP experiment using |
|
coingame_maa2c_mdp_qfctde_L10.yml |
MDP experiment using |
|
coingame_maa2c_mdp_fctde_size3.yml |
MDP experiment using |
|
coingame_maa2c_mdp_fctde_size6.yml |
MDP experiment using |
|
coingame_maa2c_mdp_fctde_size24.yml |
MDP experiment using |
|
coingame_maa2c_mdp_sctde_size3.yml |
MDP experiment using |
|
coingame_maa2c_mdp_sctde_size6.yml |
MDP experiment using |
|
coingame_maa2c_mdp_sctde_size24.yml |
MDP experiment using |
|
coingame_maa2c_pomdp_eqmarl_psi+_L2.yml |
POMDP experiment |
|
coingame_maa2c_pomdp_eqmarl_psi+_L10.yml |
POMDP experiment |
|
coingame_maa2c_pomdp_qfctde_L2.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_qfctde_L10.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_fctde_size3.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_fctde_size6.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_fctde_size24.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_sctde_size3.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_sctde_size6.yml |
POMDP experiment using |
|
coingame_maa2c_pomdp_sctde_size24.yml |
POMDP experiment using |
|
cartpole_maa2c_mdp_eqmarl_psi+.yml |
MDP experiment using |
|
cartpole_maa2c_mdp_fctde.yml |
MDP experiment using |
|
cartpole_maa2c_mdp_qfctde.yml |
MDP experiment using |
|
cartpole_maa2c_mdp_sctde.yml |
MDP experiment using |
|
cartpole_maa2c_pomdp_eqmarl_psi+.yml |
POMDP experiment using |
|
cartpole_maa2c_pomdp_fctde.yml |
POMDP experiment using |
|
cartpole_maa2c_pomdp_qfctde.yml |
POMDP experiment using |
|
cartpole_maa2c_pomdp_sctde.yml |
POMDP experiment using |
The actor-critic models trained using the frameworks described in the paper achieved the performance outlined in the sections below.
Pre-trained models can be found in the supplementary materials, within a folder called pre_trained_models/
, that accompanies this repository.
The training result metrics for all models reported in the paper are listed under the experiment_output
folder.
Each experiment was conducted over 10 seeds (using the -r 10
option as discussed in the Training section).
All figures reported in the paper can be generated using the Jupyter notebook figure_generator.ipynb
, which references the figure configurations outlined in the figures
folder.
The training results for the comparison of entanglement styles outlined in the paper are given in the table below:
Dynamics | Entanglement | Score: 20 | Score: 25 | Score: Max (value) |
---|---|---|---|---|
MDP | 568 | 2332 | 2942 (25.67) | |
MDP | 595 | 1987 | 2849 (25.45) | |
MDP | 612 | 1883 | 2851 (25.51) | |
MDP | 691 | 2378 | 2984 (25.23) | |
MDP | 839 | 2337 | 2495 (25.12) | |
POMDP | 1049 | 1745 | 2950 (26.28) | |
POMDP | 1206 | 2114 | 2999 (25.95) | |
POMDP | 1269 | - | 2992 (24.1) | |
POMDP | 1838 | - | 2727 (22.8) | |
POMDP | 1069 | 1955 | 2841 (26.39) |
The figures that aggregate the metric performance for each of the experiments are given in the table below:
Figure | Dynamics | Metric |
---|---|---|
fig_maa2c_mdp_entanglement_compare-undiscounted_reward.pdf | MDP | Score |
fig_maa2c_mdp_entanglement_compare-coins_collected.pdf | MDP | Total coins collected |
fig_maa2c_mdp_entanglement_compare-own_coin_rate.pdf | MDP | Own coin rate |
fig_maa2c_mdp_entanglement_compare-own_coins_collected.pdf | MDP | Own coins collected |
fig_maa2c_pomdp_entanglement_compare-undiscounted_reward.pdf | POMDP | Score |
fig_maa2c_pomdp_entanglement_compare-coins_collected.pdf | POMDP | Total coins collected |
fig_maa2c_pomdp_entanglement_compare-own_coin_rate.pdf | POMDP | Own coin rate |
fig_maa2c_pomdp_entanglement_compare-own_coins_collected.pdf | POMDP | Own coins collected |
The training results for the comparison of the frameworks in the
Dynamics | Framework | Score: 20 | Score: 25 | Score: Max (value) | Own coin rate: 0.95 | Own coin rate: 1.0 | Own coin rate: Max (value) |
---|---|---|---|---|---|---|---|
MDP | 568 | 2332 | 2942 (25.67) | 376 | 2136 | 2136 (1.0) | |
MDP | 678 | - | 2378 (23.38) | 397 | - | 2832 (0.9972) | |
MDP | 1640 | 2615 | 2631 (25.3) | 1511 | - | 2637 (0.9864) | |
MDP | 1917 | - | 2925 (23.67) | 1700 | - | 2909 (0.9857) | |
POMDP | 1049 | 1745 | 2950 (26.28) | 773 | - | 2533 (0.9997) | |
POMDP | 1382 | 2124 | 2871 (26.09) | 1038 | 2887 | 2887 (1.0) | |
POMDP | 1738 | 2750 | 2999 (25.33) | 1588 | - | 2956 (0.9894) | |
POMDP | 1798 | 2658 | 2824 (25.49) | 1574 | - | 2963 (0.9894) |
The figures that aggregate the metric performance for each of the experiments are given in the table below:
Figure | Dynamics | Metric |
---|---|---|
fig_maa2c_mdp-undiscounted_reward.pdf | MDP | Score |
fig_maa2c_mdp-coins_collected.pdf | MDP | Total coins collected |
fig_maa2c_mdp-own_coin_rate.pdf | MDP | Own coin rate |
fig_maa2c_mdp-own_coins_collected.pdf | MDP | Own coins collected |
fig_maa2c_pomdp-undiscounted_reward.pdf | POMDP | Score |
fig_maa2c_pomdp-coins_collected.pdf | POMDP | Total coins collected |
fig_maa2c_pomdp-own_coin_rate.pdf | POMDP | Own coin rate |
fig_maa2c_pomdp-own_coins_collected.pdf | POMDP | Own coins collected |
The training results for the comparison of the frameworks in the
Dynamics | Framework | Reward: Mean | Reward: Std. Dev. | Reward: 95% CI |
---|---|---|---|---|
MDP | 79.11 | 50.62 | (77.40, 81.16) | |
MDP | 121.35 | 110.13 | (118.29, 125.12) | |
MDP | 16.38 | 35.97 | (16.29, 16.48) | |
MDP | 15.15 | 24.17 | (15.09, 15.22) | |
POMDP | 82.28 | 44.24 | (80.60, 83.89) | |
POMDP | 79.03 | 44.06 | (76.80, 80.98) | |
POMDP | 40.56 | 37.36 | (38.17, 43.70) | |
POMDP | 13.93 | 29.84 | (13.62, 14.19) |
Dynamics | Framework | Reward: Mean (value) | Reward: Max (value) |
---|---|---|---|
MDP | 166 (79.11) | 555 (134.16) | |
MDP | 189 (121.35) | 810 (262.43) | |
MDP | 9 (16.38) | 931 (23.59) | |
MDP | 9 (15.15) | 38 (18.55) | |
POMDP | 251 (82.28) | 770 (127.6) | |
POMDP | 276 (79.03) | 648 (137.66) | |
POMDP | 680 (40.56) | 999 (167.32) | |
POMDP | 9 (13.93) | 999 (28.66) |
The figures that aggregate the metric performance for each of the experiments are given in the table below:
Figure | Dynamics | Metric |
---|---|---|
fig_cartpole_maa2c_mdp-reward_mean.pdf | MDP | Average reward |
fig_cartpole_maa2c_pomdp-reward_mean.pdf | POMDP | Average reward |
The training results for the comparison of the frameworks in the
Dynamics | Framework | Reward: Mean (value) | Reward: 95% CI | Number of Trainable Critic Parameters |
---|---|---|---|---|
POMDP | -63.04 | (-65.16, -61.06) | 29,601 | |
POMDP | -85.86 | (-87.03, -84.72) | 3,697 | |
POMDP | -88.02 | (-88.69, -87.10) | 29,801 | |
POMDP | -13.32 | (-14.68, -11.91) | 3,697 |
The figures that aggregate the metric performance for each of the experiments are given in the table below:
Figure | Dynamics | Metric |
---|---|---|
fig_minigrid-reward_mean.pdf | POMDP | Average reward |
The training results for the ablation experiment using in the
Dynamics | Framework | Parameters | Score: Mean | Score: Std. Dev. | Score: 95% CI | Own coin rate: Mean | Own coin rate: Std. Dev. | Own coin rate: 95% CI |
---|---|---|---|---|---|---|---|---|
MDP | 223 | 2.42 | 2.35 | (2.35, 2.49) | 0.6720 | 0.2024 | (0.6685, 0.6769) | |
MDP | 445 | 7.41 | 3.46 | (7.19, 7.65) | 0.7658 | 0.1414 | (0.7610, 0.7712) | |
MDP | 889 | 12.36 | 4.41 | (12.09, 12.67) | 0.8202 | 0.1379 | (0.8139, 0.8262) | |
MDP | 1777 | 17.63 | 2.58 | (17.25, 17.91) | 0.8823 | 0.0751 | (0.8770, 0.8875) | |
MDP | 229 | 3.24 | 3.09 | (3.16, 3.33) | 0.6852 | 0.1991 | (0.6821, 0.6897) | |
MDP | 457 | 8.54 | 3.67 | (8.29, 8.78) | 0.7857 | 0.1327 | (0.7804, 0.7924) | |
MDP | 913 | 14.18 | 2.69 | (13.90, 14.60) | 0.8504 | 0.0928 | (0.8454, 0.8553) | |
MDP | 1825 | 18.18 | 2.41 | (17.84, 18.53) | 0.8936 | 0.0673 | (0.8896, 0.8979) | |
MDP | 121 | 6.58 | 3.92 | (6.47, 6.66) | 0.8482 | 0.1921 | (0.8435, 0.8518) | |
MDP | 265 | 19.41 | 6.23 | (19.23, 19.59) | 0.9398 | 0.1020 | (0.9366, 0.9426) | |
MDP | 505 | 22.08 | 2.22 | (21.91, 22.26) | 0.9691 | 0.0247 | (0.9665, 0.9723) | |
MDP | 121 | 5.38 | 3.74 | (5.30, 5.46) | 0.8271 | 0.2213 | (0.8234, 0.8300) | |
MDP | 265 | 21.11 | 2.65 | (20.92, 21.35) | 0.9640 | 0.0347 | (0.9601, 0.9667) | |
MDP | 505 | 22.45 | 2.23 | (22.28, 22.62) | 0.9719 | 0.0219 | (0.9685, 0.9745) | |
POMDP | 169 | 2.98 | 2.47 | (2.91, 3.05) | 0.7082 | 0.1890 | (0.7039, 0.7123) | |
POMDP | 337 | 7.15 | 3.06 | (6.95, 7.37) | 0.7711 | 0.1388 | (0.7658, 0.7781) | |
POMDP | 673 | 13.46 | 3.24 | (13.09, 13.76) | 0.8443 | 0.1026 | (0.8396, 0.8506) | |
POMDP | 1345 | 17.38 | 2.65 | (17.06, 17.73) | 0.8889 | 0.0752 | (0.8840, 0.8945) | |
POMDP | 175 | 2.68 | 2.60 | (2.61, 2.74) | 0.6834 | 0.1942 | (0.6792, 0.6866) | |
POMDP | 349 | 6.35 | 3.53 | (6.18, 6.54) | 0.7677 | 0.1488 | (0.7633, 0.7725) | |
POMDP | 697 | 13.70 | 2.79 | (13.44, 13.99) | 0.8466 | 0.0985 | (0.8411, 0.8515) | |
POMDP | 1393 | 17.97 | 2.60 | (17.67, 18.25) | 0.8948 | 0.0723 | (0.8898, 0.9004) | |
POMDP | 745 | 12.34 | 7.56 | (12.09, 12.60) | 0.8335 | 0.2058 | (0.8277, 0.8386) | |
POMDP | 817 | 16.79 | 4.66 | (16.45, 17.04) | 0.9040 | 0.1135 | (0.8994, 0.9091) | |
POMDP | 937 | 18.14 | 4.28 | (17.83, 18.31) | 0.9476 | 0.0660 | (0.9443, 0.9508) | |
POMDP | 745 | 17.14 | 3.98 | (16.77, 17.47) | 0.8834 | 0.1106 | (0.8769, 0.8896) | |
POMDP | 817 | 18.49 | 3.91 | (18.23, 18.80) | 0.9226 | 0.0831 | (0.9172, 0.9272) | |
POMDP | 937 | 19.09 | 3.44 | (18.86, 19.46) | 0.9485 | 0.0603 | (0.9458, 0.9523) |
Framework | Ablation Selection | Model | MDP dynamics | POMDP dynamics |
---|---|---|---|---|
Actor | 136 | 412 | ||
Critic | 265 (132 per agent, 1 central) | 817 (408 per agent, 1 central) | ||
Actor | 136 | 412 | ||
Critic | 265 | 817 | ||
Actor | 496 | 388 | ||
Critic | 889 | 673 | ||
Actor | 496 | 388 | ||
Critic | 913 (444 per agent, 25 central) | 697 (336 per agent, 25 central) |
The figures that aggregate the metric performance for each of the experiments are given in the table below:
If you use the code in this repository for your research or publication, please cite our paper published in ICLR 2025 using the following BibTeX entry (also available in CITATION.bib):
@inproceedings{derieux2025eqmarl,
title={e{QMARL}: Entangled Quantum Multi-Agent Reinforcement Learning for Distributed Cooperation over Quantum Channels},
author={Alexander DeRieux and Walid Saad},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=cR5GTis5II},
doi={10.48550/arXiv.2405.17486}
}