Skip to content

PMBio/Health-Privacy-Challenge

Repository files navigation

CAMDA 2025 - ELSA Health Privacy Challenge

This repository is a "starter package" for the Health Privacy Competition that runs within CAMDA Conference 2025. The Health Privacy Challenge is organized in the context of the European Lighthouse on Safe and Secure AI (ELSA, https://elsa-ai.eu).

The Health Privacy Challenge consists of two tracks:

Track I: Featuring Bulk RNA-seq

Track I runs in a “Blue Team (🫐) vs Red Team (🍅)” scheme.

  • The blue teams develop novel privacy preserving generative methods that can mitigate privacy risks while preserving biological insights for bulk gene expression datasets,
  • The red teams launch trustworthy and realistic membership inference attacks (MIA) against blue teams’ solutions to assess whether these generative methods can withstand privacy attacks.

Track II: Featuring Single-cell RNA-seq

Track II invites participants to explore the privacy and utility of synthetic single-cell gene expression (scRNA-seq) data. Participants are encouraged to:

  • investigate and reveal potential privacy risks linked to generating synthetic scRNA-seq datasets.
  • develop privacy-preserving generative methods that balances data privacy and utility.
  • propose novel evaluation metrics and strategies to assess both utility and privacy preservation in a multi-sample donor setting.

We are looking forward to engaging with you and working together to deepen our understanding of privacy in healthcare. 🤗

Introduction

This repository contains:

  • 👩‍💻 Baseline code for generative methods (Blue Teams) and Membership Inference Attack algorithms (Red teams).
  • 📝 Documentation that details setup and submission instructions for the competition.
  • 📎 Submission templates to base your submissions on.

Other resources:

🎢 Get started!

Both teams, please check out Getting Started to set up and use the starter package!

Datasets

Datasets are available for download in ELSA Benchmarks Competition platform after registration and signing the data download agreement.

Track I: Featuring bulk RNA-seq

We re-distribute pre-processed versions of two open-access TCGA RNA-seq datasets, available through the GDC portal:

  • TCGA-BRCA RNASeq

    Dimensions: <1089 x 978> Details: Suitable for cancer subtype prediction (5 subtypes)

  • TCGA COMBINED RNASeq (with 10 different cancer tissues )

    Dimensions: <4323 x 978> Details: Suitable for cancer tissue of origin prediction (10 tissues)

Navigate here for details about the pre-processing steps.

Track II: Featuring single-cell RNA-seq

We re-distribute raw counts of OneK1K single-cell RNA-seq dataset (https://onek1k.org/), a cohort containing 1.26 million peripheral blood mononuclear cells (PBMCs) of 981 donors, generously provided by Joseph Powell and the authors (Yazar et al., 2022) in Garvan Institute of Medical Research.

  • Train dataset: <633711 cells from 490 donors x 25834 genes >
  • Test dataset: <634022 cells from 491 donors x 25834 genes >

Navigate Track II homepage for details about the pre-processing steps.

📅 Schedule

Timeline

👥 Organization Team

This competition is designed as a collaborative effort between European Molecular Biology Laboratory (EMBL), CISPA Helmholtz Center for Information Security, and the University of Helsinki with the support of Barcelona Computer Vision Center (CVC) within the context of ELSA Project.

and in collaboration with Saez-Rodriguez group in Track II and the review process:

We also thank Katharina Mikulik (DKFZ), Kevin Domanegg (DKFZ), and Danai Vaigaki (EMBL) for helpful feedback.