Source code on GitHub.
> Paper Repository — Preprocessing pipeline for converting raw 3D CT volumes into a slice-based WebDataset format for scalable distributed training.
---
This repository implements the dataset preparation pipeline described in the paper. Raw 3D CT volumes and annotations from the CT-RATE dataset are converted into a slice-based WebDataset format optimized for high-throughput, multi-GPU training. The dataset is structured to natively support 2D, 2.5D (slab), and 3D training paradigms — the actual sampling logic for each modality (e.g., 2.5D slab extraction, 3D patch sampling) lives in the dataloader of the training codebase, but the shards produced here are pre-organized to make those lookups fast and efficient. The pipeline handles:
---
``
.
├── README.md # This file
│
├── 1_preprocessing/ # Stage 1: Raw NIfTI → NPY conversion
│ ├── preprocess_val.py # Validation set: NIfTI → NPY (HU + transpose)
│ ├── preprocess_val_resized.py # Validation set: NIfTI → NPY (HU + 1mm isotropic resample)
│ └── preprocess_rad.py # RAD-ChestCT: NPZ → NPY (1mm isotropic resample)
│
├── 2_label_generation/ # Stage 2: Label creation & metadata extraction
│ ├── create_rad_labels.py # Collapse RAD-ChestCT location columns → 16 ev...