fastrad GPU Radiomics: IBSI-Validated, PyRadiomics-Compatible and 25x Faster

fastrad accelerates radiomics: IBSI‑validated PyRadiomics feature set runs 25× faster on GPU

fastrad delivers IBSI-validated, PyTorch-native radiomics feature extraction up to 25× faster on GPU than PyRadiomics, preserving numerical parity for research.

fastrad, a new PyTorch-native library, rethinks radiomics extraction by moving the full PyRadiomics feature set onto GPU-friendly torch.Tensors and delivering orders-of-magnitude speed improvements without sacrificing numeric fidelity. Radiomics — the practice of computing quantitative imaging biomarkers from CT and MRI — underpins many oncology studies and machine-learning pipelines, but extraction has long been a processing bottleneck when using the widely adopted CPU-bound PyRadiomics toolkit. fastrad targets that bottleneck directly: it implements the complete, IBSI-standardized feature collection as native tensor operations, supports automatic device routing to CUDA or CPU, and is designed to slot into existing PyRadiomics workflows with minimal code changes.

Why radiomics extraction slows research and ML development

Radiomic features feed prognostic models, treatment-response classifiers, and phenotype discovery in oncology. In practice, however, generating those features across hundreds or thousands of scans becomes a time and cost constraint. PyRadiomics — the de facto standard maintained by Dana-Farber / Brigham and Women’s Hospital — is robust and validated, but it runs entirely on CPU. Measured on a modern workstation, PyRadiomics consumes roughly 2.9 seconds per CT scan for a typical lung nodule, a rate that quickly accumulates when processing multi-site clinical cohorts or iterating through feature-space experiments.

That per-scan latency matters: long extraction times slow model development cycles, increase compute costs for large studies, and make real-time or near-real-time analysis impractical. The problem is not merely throughput: moving features off the GPU forces extra device-to-device copying in training loops and complicates integration with PyTorch-based pipelines.

What fastrad changes in the radiomics workflow

fastrad reconceives the extraction pipeline as a PyTorch-native operation. Everything — DICOM ingestion, ROI handling, intermediate transforms, and final feature computation — operates on torch.Tensor objects. The library exposes a familiar API that mirrors PyRadiomics, lowering the barrier to adoption: users can instantiate a RadiomicsFeatureExtractor with device=’auto’ and call execute(image, mask) to run extraction on GPU when available or fall back to CPU.

Because features remain as tensors, outputs can be passed directly into downstream PyTorch models without NumPy round-trips. That design reduces overhead in end-to-end ML workflows and enables batched GPU processing of many ROIs where appropriate.

Feature coverage and standards compliance

fastrad implements the complete IBSI-standardized feature set corresponding to the full PyRadiomics offering — not a subset. The library computes features across eight IBSI classes:

First‑order statistics (18 features) — intensity descriptors such as mean, entropy, kurtosis.
Shape (3D) (14 features) — volume, surface area, sphericity, compactness.
Per-slice axial shape descriptors (2D) — slice-wise shape metrics.
GLCM (24 features) — grey-level co-occurrence matrix measures.
GLRLM (16 features) — grey-level run-length matrix statistics.
GLSZM (16 features) — grey-level size-zone matrix features.
GLDM (14 features) — grey-level dependence metrics.
NGTDM (5 features) — neighbourhood grey-tone difference measures.

Total feature count aligns with the 105 features used in IBSI Phase 1 validation suites. This parity is important for research reproducibility: investigators can replace PyRadiomics with fastrad without changing the downstream models’ inputs.

Measured performance: where the speed comes from

Benchmarks run on an NVIDIA RTX 4070 Ti show dramatic end-to-end improvements. Using a real non–small cell lung cancer (NSCLC) CT from TCIA, PyRadiomics required about 2.90 seconds per scan. fastrad reduces that to approximately 0.116 seconds per scan on the same workload — a roughly 25× speedup. Even on CPU-only execution, fastrad is faster than PyRadiomics: a single-thread fastrad CPU run completed in 1.10 seconds, about 2.6× faster than PyRadiomics on the same hardware.

Per-class GPU speedups vary by algorithmic complexity and data-parallel opportunity, with measured acceleration factors including:

First‑order statistics: ~49.3×
Shape (3D): ~35.0×
GLCM: ~19.9×
GLRLM: ~12.9×
GLSZM: ~22.5×
GLDM: ~37.2×
NGTDM: ~31.7×

At the reported 0.116 seconds per scan, a single RTX 4070 Ti would theoretically process on the order of 860 scans per minute, converting what used to be hours-long batch jobs into minute-scale runs for medium-sized cohorts.

The library’s performance scales across clinically relevant ROI sizes. For example, speedups measured at different ROI radii were approximately 25.9× for 5 mm radii (≈199 voxels), 18.9× for 15 mm (≈8,263 voxels), and 9.7× for 30 mm (≈67,461 voxels). Even large pulmonary nodules retain a near‑tenfold advantage, reflecting practical gains for a wide range of lesion sizes.

Apple Silicon users also see benefits: on an M3 MacBook Air run in CPU-only mode, fastrad outperforms PyRadiomics (8-thread) by about 3.56×, leveraging PyTorch’s ARM NEON vectorization.

Algorithmic improvements: more than just parallelism

Not all speedups are pure parallel scaling. Some classes benefit from algorithmic redesigns. The GLSZM (grey-level size-zone matrix) computation is a good example: PyRadiomics submits the whole image volume to connected-component labeling via scipy.ndimage.label and then discards background labels, incurring substantial unnecessary work for small ROIs. fastrad instead performs connected-component labeling on a tight bounding-box crop around the ROI, reducing the labelled volume by roughly three orders of magnitude for typical nodules. That optimization alone delivers a measured ~23.3× CPU speedup for GLSZM over the prior approach, demonstrating that rethinking serial algorithms for ROI locality can yield gains beyond raw GPU parallelism.

Numerical validation and parity with PyRadiomics

Performance improvements matter only if computed features are correct and reproducible. fastrad was validated rigorously:

IBSI Phase 1 digital phantom compliance: across 105 reference features, the maximum absolute relative deviation versus the IBSI reference was on the order of machine epsilon (3.20 × 10⁻¹⁴%), with zero features outside the formal 1% compliance threshold.
Feature-by-feature parity with PyRadiomics on a real NSCLC CT: all 105 features agreed within approximately 1×10⁻¹¹, far tighter than the common 10⁻⁴ tolerance threshold — effectively seven orders of magnitude better. This tight agreement means models trained using PyRadiomics outputs can consume fastrad features without recalibration or retraining.
Scan–rescan reproducibility: using the RIDER Lung CT scan‑rescan dataset (n=32), intraclass correlation coefficient distributions were compared between fastrad and PyRadiomics via a paired Wilcoxon signed-rank test (W = 647, p = 0.411), indicating no statistically significant difference in scan–rescan variability introduced by fastrad.

These results position fastrad not as a heuristic speed hack, but as a standards‑conformant, production‑grade alternative.

Architecture highlights and memory behavior

Design principles for fastrad emphasize tensor-native computation, device-agnostic feature modules, and predictable device routing:

Everything is a torch.Tensor: intermediate representations stay on the device, enabling zero-copy transfers into PyTorch models and simplifying batched workflows.
Device routing: the extractor supports automatic detection (‘auto’), explicit GPU (‘cuda’) selection (which errors if CUDA is unavailable), and CPU-only operation, resolving device choice at initialization while keeping individual feature modules device‑agnostic.
Memory footprint: the full GPU pipeline exhibits a modest peak VRAM of roughly 655 MB, which is within reach of consumer GPUs with even 1 GB of VRAM. On CPU, fastrad materializes intermediate tensors and shows higher peak RAM usage for large ROIs (up to ~11.4× higher than PyRadiomics at a 30 mm ROI). This tradeoff reflects the library’s design choice to favor immediate tensor availability for downstream operations; a lazy-evaluation mode intended to reduce CPU RAM pressure is planned for future releases.

Platform support, installation, and ecosystem integration

fastrad is distributed on PyPI and released under the Apache 2.0 license. Installation modes include a CUDA-enabled variant that pins compatible PyTorch builds to the CUDA 12.x index and installs cucim for GPU-accelerated connected-component labeling:

CPU + GPU: pip install fastrad[cuda]
CPU only: pip install fastrad

Minimum Python requirement is 3.11. The project maintains a GitHub repository and archives reproducibility artifacts (benchmark scripts, environment specifications, and retrieval instructions) in a Zenodo package accompanying the primary preprint.

Because fastrad preserves feature definitions and numeric parity with PyRadiomics, it can act as a drop‑in replacement in many codebases. Common internal-link phrases that would be useful for documentation and migration guides include GPU acceleration, PyRadiomics migration guide, medical imaging pipelines, and reproducibility archive.

Current limitations and planned work

The maintainers are explicit about gaps in functionality:

File formats: fastrad currently supports DICOM only; NIfTI and MetaImage formats are not yet available. Integration with nibabel is planned.
Convolutional filter features: IBSI Phase 2 filter-based features (wavelets, Laplacian-of-Gaussian) are not yet implemented.
CPU memory behavior: high peak RAM for very large ROIs under CPU-only runs is a limitation; a lazy-evaluation mode is planned to mitigate this.

Contributions are encouraged — especially for NIfTI support, lazy evaluation, and Phase 2 filter features — and the project accepts issues and pull requests on GitHub.

Implications for researchers, clinical trials, and developers

fastrad’s combination of validated outputs and high throughput has practical implications across multiple stakeholders:

Researchers can iterate faster. Reduced extraction time shortens the loop for feature selection, model training, and cross-validation, enabling more experimental bandwidth within fixed grant cycles.
Clinical studies and multi‑site trials can scale without proportionally larger compute budgets. A single consumer GPU processing hundreds of scans per minute moves cohort-wide feature extraction from a cluster-bound job to a workstation or single GPU instance operation.
Developers building end-to-end PyTorch pipelines gain a cleaner data path: keeping features as tensors avoids host-device shuffling and simplifies batching and augmentation pipelines. This is particularly relevant where radiomics feeds directly into deep learning models for downstream tasks like outcome prediction or lesion classification.
For institutions where reproducibility and regulatory traceability matter, the IBSI validation and exact numeric parity with PyRadiomics reduce barriers to adopting the optimized tool: models trained historically with PyRadiomics do not require retraining when switching to fastrad.

Beyond immediate productivity gains, fastrad demonstrates how reimplementing domain algorithms in modern tensor frameworks can unlock both algorithmic and systems-level improvements. The GLSZM example shows that reexamining preexisting library calls (e.g., whole-volume labeling) and applying ROI-local algorithms can produce large wins even before considering hardware acceleration.

Reproducibility, benchmarking, and community practices

All reported benchmarks are reproducible: the project provides an archived package (Zenodo) that contains environment specs, scripts, and instructions for retrieving the data used in evaluations. Continuous integration runs the full validation suite on CPU for every pull request via GitHub Actions, ensuring that changes preserve numeric correctness and test coverage. The project also supplies a canonical citation for academic use and an open-source license to facilitate adoption.

These practices — reproducibility archives, automated CI validation, and explicitly provided citation information — make fastrad amenable to both academic research and industrial prototyping and set a useful example for other scientific software projects.

fastrad’s emergence signals a pragmatic shift in medical‑imaging tooling: instead of monolithic CPU-bound libraries, modular implementations that exploit tensor frameworks can simultaneously improve correctness, interoperability, and throughput. For teams managing large imaging cohorts, the trade-offs are already compelling: faster extraction, lower data-movement overhead, and a path to GPU-hardened feature pipelines.

Looking ahead, expect fastrad to focus on completing IBSI Phase 2 filters, adding NIfTI support, and introducing a lazy-evaluation mode to reduce CPU memory footprint for very large ROIs; community contributions will likely accelerate those items. As more ML and imaging teams centralize on PyTorch for both training and inference, tools that natively live in that ecosystem — validated, fast, and interoperable — will play a central role in translating radiomics research into deployable models and trial-ready analytics.