Can a Deep-learning Model for the Automated Detection of Vertebral Fractures Approach the Performance Level of Human Subspecialists?

Yi Chu Li, Hung Hsun Chen, Henry Horng-Shing Lu, Hung Ta Hondar Wu, Ming Chau Chang, Po Hsin Chou

Research output: Contribution to journalArticlepeer-review

34 Scopus citations


BACKGROUND: Vertebral fractures are the most common osteoporotic fractures in older individuals. Recent studies suggest that the performance of artificial intelligence is equal to humans in detecting osteoporotic fractures, such as fractures of the hip, distal radius, and proximal humerus. However, whether artificial intelligence performs as well in the detection of vertebral fractures on plain lateral spine radiographs has not yet been reported. QUESTIONS/PURPOSES: (1) What is the accuracy, sensitivity, specificity, and interobserver reliability (kappa value) of an artificial intelligence model in detecting vertebral fractures, based on Genant fracture grades, using plain lateral spine radiographs compared with values obtained by human observers? (2) Do patients' clinical data, including the anatomic location of the fracture (thoracic or lumbar spine), T-score on dual-energy x-ray absorptiometry, or fracture grade severity, affect the performance of an artificial intelligence model? (3) How does the artificial intelligence model perform on external validation? METHODS: Between 2016 and 2018, 1019 patients older than 60 years were treated for vertebral fractures in our institution. Seventy-eight patients were excluded because of missing CT or MRI scans (24% [19]), poor image quality in plain lateral radiographs of spines (54% [42]), multiple myeloma (5% [4]), and prior spine instrumentation (17% [13]). The plain lateral radiographs of 941 patients (one radiograph per person), with a mean age of 76 ± 12 years, and 1101 vertebral fractures between T7 and L5 were retrospectively evaluated for training (n = 565), validating (n = 188), and testing (n = 188) of an artificial intelligence deep-learning model. The gold standard for diagnosis (ground truth) of a vertebral fracture is the interpretation of the CT or MRI reports by a spine surgeon and a radiologist independently. If there were any disagreements between human observers, the corresponding CT or MRI images would be rechecked by them together to reach a consensus. For the Genant classification, the injured vertebral body height was measured in the anterior, middle, and posterior third. Fractures were classified as Grade 1 (< 25%), Grade 2 (26% to 40%), or Grade 3 (> 40%). The framework of the artificial intelligence deep-learning model included object detection, data preprocessing of radiographs, and classification to detect vertebral fractures. Approximately 90 seconds was needed to complete the procedure and obtain the artificial intelligence model results when applied clinically. The accuracy, sensitivity, specificity, interobserver reliability (kappa value), receiver operating characteristic curve, and area under the curve (AUC) were analyzed. The bootstrapping method was applied to our testing dataset and external validation dataset. The accuracy, sensitivity, and specificity were used to investigate whether fracture anatomic location or T-score in dual-energy x-ray absorptiometry report affected the performance of the artificial intelligence model. The receiver operating characteristic curve and AUC were used to investigate the relationship between the performance of the artificial intelligence model and fracture grade. External validation with a similar age population and plain lateral radiographs from another medical institute was also performed to investigate the performance of the artificial intelligence model. RESULTS: The artificial intelligence model with ensemble method demonstrated excellent accuracy (93% [773 of 830] of vertebrae), sensitivity (91% [129 of 141]), and specificity (93% [644 of 689]) for detecting vertebral fractures of the lumbar spine. The interobserver reliability (kappa value) of the artificial intelligence performance and human observers for thoracic and lumbar vertebrae were 0.72 (95% CI 0.65 to 0.80; p < 0.001) and 0.77 (95% CI 0.72 to 0.83; p < 0.001), respectively. The AUCs for Grades 1, 2, and 3 vertebral fractures were 0.919, 0.989, and 0.990, respectively. The artificial intelligence model with ensemble method demonstrated poorer performance for discriminating normal osteoporotic lumbar vertebrae, with a specificity of 91% (260 of 285) compared with nonosteoporotic lumbar vertebrae, with a specificity of 95% (222 of 234). There was a higher sensitivity 97% (60 of 62) for detecting osteoporotic (dual-energy x-ray absorptiometry T-score ≤ -2.5) lumbar vertebral fractures, implying easier detection, than for nonosteoporotic vertebral fractures (83% [39 of 47]). The artificial intelligence model also demonstrated better detection of lumbar vertebral fractures compared with detection of thoracic vertebral fractures based on the external dataset using various radiographic techniques. Based on the dataset for external validation, the overall accuracy, sensitivity, and specificity on bootstrapping method were 89%, 83%, and 95%, respectively. CONCLUSION: The artificial intelligence model detected vertebral fractures on plain lateral radiographs with high accuracy, sensitivity, and specificity, especially for osteoporotic lumbar vertebral fractures (Genant Grades 2 and 3). The rapid reporting of results using this artificial intelligence model may improve the efficiency of diagnosing vertebral fractures. The testing model is available at One or multiple plain lateral radiographs of the spine in the Digital Imaging and Communications in Medicine format can be uploaded to see the performance of the artificial intelligence model. LEVEL OF EVIDENCE: Level II, diagnostic study.

Original languageEnglish
Pages (from-to)1598-1612
Number of pages15
JournalClinical Orthopaedics and Related Research
Issue number7
StatePublished - 1 Jul 2021


Dive into the research topics of 'Can a Deep-learning Model for the Automated Detection of Vertebral Fractures Approach the Performance Level of Human Subspecialists?'. Together they form a unique fingerprint.

Cite this