Home / Projects / Process-CT-Data
Project active

Process-CT-Data

Source code on GitHub.

Repository Info

LanguagePython
LicenseApache-2.0
Last Updated2026-05-07

README

Process-CT-Data

> Paper Repository — Preprocessing pipeline for converting raw 3D CT volumes into a slice-based WebDataset format for scalable distributed training.

---

Overview

This repository implements the dataset preparation pipeline described in the paper. Raw 3D CT volumes and annotations from the CT-RATE dataset are converted into a slice-based WebDataset format optimized for high-throughput, multi-GPU training. The dataset is structured to natively support 2D, 2.5D (slab), and 3D training paradigms — the actual sampling logic for each modality (e.g., 2.5D slab extraction, 3D patch sampling) lives in the dataloader of the training codebase, but the shards produced here are pre-organized to make those lookups fast and efficient. The pipeline handles:

  • HU Conversion & Float16 Casting — Raw NIfTI volumes are converted to Hounsfield Units (HU) using per-volume Rescale Slope/Intercept metadata and cast to 16-bit floating-point precision.
  • Dual Annotation Synchronization — Two annotation sources are spatially aligned:
  • - TotalSegmentator (TS): 118 anatomical classes, automatically generated for all images, resized to 128×128. - ReXGroundingCT (ReX): Professional annotations for multi-label radiological findings (14 classes), binarized and resized to 128×128.

    • Slab-Based Grouping — Incoming slices are grouped into continuous 12 mm slabs, dynamically calculated using Z-spacing metadata, enabling efficient 2.5D sampling in downstream dataloaders.
    • Intensity Normalization — HU values are clipped to [−997, 888] and z-score normalized (μ = −142, σ = 361), derived from the 0.5% and 99.5% percentiles of foreground voxels across a random sample of 1,000 CT-RATE scans.
    • Multi-Crop Strategy — Two global crops (size 256) from the center slice and eight local crops (size 144) sampled throughout the 12 mm window, with 80% probability of centering on RAD-ChestCT labels (falling back to TotalSegmentator masks when absent).

    ---

    Repository Structure

    `` . ├── README.md # This file │ ├── 1_preprocessing/ # Stage 1: Raw NIfTI → NPY conversion │ ├── preprocess_val.py # Validation set: NIfTI → NPY (HU + transpose) │ ├── preprocess_val_resized.py # Validation set: NIfTI → NPY (HU + 1mm isotropic resample) │ └── preprocess_rad.py # RAD-ChestCT: NPZ → NPY (1mm isotropic resample) │ ├── 2_label_generation/ # Stage 2: Label creation & metadata extraction │ ├── create_rad_labels.py # Collapse RAD-ChestCT location columns → 16 ev...