Vision Foundry is a self-service platform for AI-powered image analysis. It is designed to assist researchers working with unlabeled or hard-to-label image datasets, especially in the biomedical domain. It helps researchers extract high-dimensional feature representations from unlabeled image data.
At the core of Vision Foundry is DinoMX, a modular PyTorch-based training framework that facilitates self-supervised representation learning using Vision Transformers (ViTs). The pipeline builds upon the DINO and DINOv2 frameworks (self-distillation with no labels) introduced by Meta in 2023.
The initial use case focused on neuropathology. As part of the Federated Brain Digital Slide Archive project, NP-TEST-0 was developed — a Vision Transformer pretrained using DinoMX on real-world neuropathology data. It supports transfer learning, tissue segmentation, patch-level classification, and similarity search.
Vision Foundry is powered by DinoMX, a modular PyTorch-based training framework for learning visual representations with Vision Transformers (ViTs). It builds on Meta's DINO and iBOT approaches to self-supervised learning.
DinoMX replaces traditional convolutional segmentation heads like U-Net with an attention map-based segmentation strategy. Instead of relying on decoder-specific layers, the model uses its native transformer attention maps to localize and interpret image regions.
The training pipeline is optimized for distributed training. DinoMX uses two types of configuration files: one for accelerator attributes (FSDP and DDP training strategies). All experiments are tracked via ClearML.
Outputs follow standardized Hugging Face model format for streamlined sharing.
Although the DinoMX tool is available, the self-service platform Vision Foundry is still under development.
DinoMX expands on Meta's DINO: DINOv2 Blog Post
DinoMX is being leveraged in several efforts:
Resources include the NVIDIA DGX computing cluster (5x DGX H100, 40 GPUs, 3.2 TB VRAM).
Read the paper: Vision Foundry: A System for Training Foundational Vision AI Models (arXiv)