Multi-modal vision-language model that generates free-text radiology reports from CT slice embeddings. Achieves new state-of-the-art on CT-RATE benchmark.
| Metric | Value | Previous SOTA |
|---|---|---|
| Macro F1 | 0.429 | 0.414 (U-VLM) |
Training: 4-phase curriculum on CT-RATE (~46,400 volumes):
Hardware: 8x NVIDIA H200 GPUs, DDP bf16.
Available on Hugging Face. Paper: arXiv:2603.23308
Hosted on Hugging Face.