top of page

Challenge CURVAS:

Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation

6-10 October 2024, meet us at Marrakesch for MICCAI 2024!


In medical imaging, DL models are often tasked with delineating structures or abnormalities within complex anatomical structures, such as tumors, blood vessels, or organs. Uncertainty arises from the inherent complexity and variability of these structures, leading to challenges in precisely defining their boundaries. This uncertainty is further compounded by interrater variability, as different medical experts may have varying opinions on where the true boundaries lie. DL models must grapple with these discrepancies, leading to inconsistencies in segmentation results across different annotators and potentially impacting diagnosis and treatment decisions. Addressing interrater variability in DL for medical segmentation involves the development of robust algorithms capable of capturing and quantifying uncertainty, as well as standardizing annotation practices and promoting collaboration among medical experts to reduce variability and improve the reliability of DL-based medical image analysis. Interrater variability poses significant challenges in the field of DL for medical image segmentation. 

Furthermore, achieving model calibration, a fundamental aspect of reliable predictions, becomes notably challenging when dealing with multiple classes and raters. Calibration is pivotal for ensuring that predicted probabilities align with the true likelihood of events, enhancing the model's reliability. It must be considered that, even if not clearly, having multiple classes account for uncertainties arising from their interactions. Moreover, incorporating annotations from multiple raters adds another layer of complexity, as differing expert opinions may contribute to a broader spectrum of variability and computational complexity.

Consequently, the development of robust algorithms capable of effectively capturing and quantifying variability and uncertainty, while also accommodating the nuances of multi-class and multi-rater scenarios, becomes imperative. Striking a balance between model calibration, accurate segmentation and handling variability in medical annotations is crucial for the success and reliability of DL-based medical image analysis.

Because of all the previously stated reasons, we have created a challenge that considers all of the above. In this challenge, we will work with abdominal CT scans. Each of them will have three different annotations obtained from different experts and each of the annotations will have three classes: pancreas, kidney and liver.

The main idea is to be able to evaluate the results considering multi rater information. There will be two parts. The first part will be a classical dice score evaluation and volume assessment, to give information of clinical relevance as well. The second part will consist of studying whether the model is calibrated or not. All of these evaluations will be performed considering all three different annotations



The challenge cohort consists of 90 CT images prospectively gathered at the University Hospital Erlangen between August 2023 and October 2023. Each CT will have multiple classes: background (0), pancreas (1), kidney (2) and liver (3). In addition, each of the CTs will have three different annotators from three different experts that will contain the four classes specified previously.

Training Phase cohort:
20 CT scans belonging to group A with the respective annotations will be given. It is encouraged to leverage publicly available external data annotated by multiple raters. The idea of giving a small amount of data for the training set and giving the opportunity of using a public dataset for training is to make the challenge more inclusive, giving the option to develop a method by using data that is in anyone's hands. Furthermore, by using this data to train and using other data to evaluate, it makes it more robust to shifts and other sources of variability between datasets.

You can find the training set here:

Validation Phase cohort:
5 CT scans belonging to group A will be used for this phase.

Test Phase cohort:
65 CT scans will be used for evaluation. 20 CTs belonging to group A, 22 CTs belonging to group B and 23 CTs belonging to group C.

Both validation and testing CT scans cohorts will not be published until the end of the challenge. Furthermore, to which group each CT scan belongs will not be revealed until after the challenge.

Ranking and Prices

Top five performing methods will be announced publicly. Winners will be invited to present their methods and results in the challenge event hosted in MICCAI 2024.

Two members of the participating team can be qualified as author (one must be the person that submits the results). The participating teams may publish their own results separately only after the organizer has published a challenge paper and always mentioning the organizer's challenge paper.

Captura de pantalla 2024-05-17 135406.png
Training Set Release
Open Development
  • Validation submission open
Closed Testing Phase
Preliminary Results
  • Final Algorithms 
  • Release of the results
  • Replication of results
  • Challenge winners announced
  • Contact winners to invite them to MICCAI 2024
  • Writing paper with results

This callenge is hosted in the GrandChallenge platform:

Captura de pantalla 2024-04-17 122433.png
Captura de pantalla 2024-04-17 122433.png
In collaboration with

The challenge has been co-funded by Proyectos de Colaboración Público-Privada (CPP2021-008364), funded by MCIN/AEI, and the European Union through the NextGenerationEU/PRTR

image (2).png
bottom of page