Open Pathology Dataset (OPaD) #3410
Replies: 9 comments 11 replies
-
TCGA please |
Beta Was this translation helpful? Give feedback.
-
We would need to consider how to handle the size of some of these datasets. Whilst its feasible to download some datasets within a notebook session, this is unlikely for something like Camelyon. For example, streaming for the first training epoch and then using the locally cached images thereafter. |
Beta Was this translation helpful? Give feedback.
-
What are you considering for hosting? Even ROI-based datasets can be several GB. I think the top priority should be standardizing formats for things like label or instance images, JSON or other for class names etc. If you provided the standards we would happily transform our data to comply and also use these standards in the future. |
Beta Was this translation helpful? Give feedback.
-
There is a parallel initiative to have MONAI become a "portal" to medical image datasets. The goal is to make it easy to access data from multiple sources and optionally with pre-defined training, validation, testing splits to promote reproducibility. As a portal, we would provide a collection of "access files". A user would browse and then download a specific access file when they wanted to access a particular set of data from the web. That access file would then be pass to special MONAI reader which would automatically handle the downloading, caching, and train/validation/test split of the data for deep learning research. In this way, we don't have to try to host / curate the data, we aren't responsible for anonymization or other risky acts, we don't have to fund the potentially massive storage and download costs, and we can instead focus our time/resources on making existing and future data repos easily integrated into MONAI. A very rough/incomplete/preliminary example was demonstrated in this github issue/PR that focused on getting data from NCI's "The Cancer Imaging Archive": #2212 Would such a "portal" work for pathology data? |
Beta Was this translation helpful? Give feedback.
-
I would suggest data streaming, with object store feature for example support for zarr format so the users do not have to download the whole data sets and they can download part of the data instead which they would like to process. One of the examples is TCGA data, sometimes you do not need to download the whole data set. |
Beta Was this translation helpful? Give feedback.
-
Just wanted to add here that we should think about licenses. If we are offering easy-to-use download tools, for example, are we under some legal or at least moral obligation to remind people of the licenses of the dataset? It's possible that it's a no-op, but we should at least have thought about it. |
Beta Was this translation helpful? Give feedback.
-
I think the licensing and streaming are related. Currently, a single slide is the most granular element provided by the data hosting sites. If you want finer granularity like patch/roi then I think this implies re-hosting the data. It could become messy to deal with licensing but I think there is a lot of value in going providing finer access. |
Beta Was this translation helpful? Give feedback.
-
Are you also considering to seek collaboration with existing standardisation organisation, in some ways working on the same subjects? In some areas healthcare IT struggles with lack of standardised ways of interfacing between applications, digital pathology is one of those areas. Since most of these datasets arise from clinical data sources and DICOM for example is gaining traction now in clinical implementations it would make sense to stay close to DICOMweb (JSON) API's where possible. Also IHE (Integrating the Healthcare Enterprise) is working on describing AI use cases and translating those to existing (DICOM and HL7) standards, although in the radiology domain much of the concepts are reusable for pathology imaging of course. |
Beta Was this translation helpful? Give feedback.
-
I've been following the Monai Project since its early days, attracted by its user-friendly design that simplifies understanding of various medical domain processes. I'm curious about any updates related to the project, particularly progress in streaming using zarr similar to TCGA, and any advancements regarding formats that encompass medical imaging data such as DICOM, HL7, and NIFTI. It would be exciting to have a comprehensive format for medical imaging data. This future-oriented work is crucial as lowering barriers to access and entry in the medical domain will enable more forward-thinking development. Can you share if there has been any progress in this area and where I can find more information? |
Beta Was this translation helpful? Give feedback.
-
Motivation
There are many digital pathology datasets publicly available and they have been widely used in numerous research projects and challenges. However, there is no universal and easy-to-use API for these datasets to enable users to start their AI project without doing additional work for manual downloading and data preparation for the problem at hand.
So far, in MONAI, we have focused to provide users with the capabilities to build, train and test their models but the starting point, which is accessing datasets with a simple API in a normalized data format, has not been the focal point.
Although we have touched on this topic for a couple of datasets, like MedNISTDataset and DecathlonDataset, there is not any pathology dataset support apart from some generic ones, like PatchWSIDataset and MaskedInferenceWSIDataset.
MONAI Pathology have the potential to become the go-to place to start any AI in pathology challenge and this requires a reliable data hosting, flexible downloading, special data preparations, and more importantly a normalized data format for different pathology tasks. We can achieve these goals through a self-sufficient special purposed open pathology datasets (OPAD) for any publicly available and well grounded histopathology data. TorchVision datasets are good example on how we can use a simplified and almost similar API for such datasets.
Overall, OPAD is aiming to lower the barrier for users to experiment with AI models in the pathology domain, to get their hand on real histopathology datasets, and to create a common platform for any development in this area with direct access to a variety of prepared histopathology datasets.
Requirements
Questions
Beta Was this translation helpful? Give feedback.
All reactions