TY - JOUR
T1 - Using Hypothesis-Led Machine Learning and Hierarchical Cluster Analysis to Identify Disease Pathways Prior to Dementia
T2 - Longitudinal Cohort Study
AU - Huang, Shih Tsung
AU - Hsiao, Fei Yuan
AU - Tsai, Tsung Hsien
AU - Chen, Pei Jung
AU - Peng, Li Ning
AU - Chen, Liang Kung
N1 - Publisher Copyright:
©Shih-Tsung Huang, Fei-Yuan Hsiao, Tsung-Hsien Tsai, Pei-Jung Chen, Li-Ning Peng, Liang-Kung Chen.
PY - 2023
Y1 - 2023
N2 - Background: Dementia development is a complex process in which the occurrence and sequential relationships of different diseases or conditions may construct specific patterns leading to incident dementia. Objective: This study aimed to identify patterns of disease or symptom clusters and their sequences prior to incident dementia using a novel approach incorporating machine learning methods. Methods: Using Taiwan’s National Health Insurance Research Database, data from 15,700 older people with dementia and 15,700 nondementia controls matched on age, sex, and index year (n=10,466, 67% for the training data set and n=5234, 33% for the testing data set) were retrieved for analysis. Using machine learning methods to capture specific hierarchical disease triplet clusters prior to dementia, we designed a study algorithm with four steps: (1) data preprocessing, (2) disease or symptom pathway selection, (3) model construction and optimization, and (4) data visualization. Results: Among 15,700 identified older people with dementia, 10,466 and 5234 subjects were randomly assigned to the training and testing data sets, and 6215 hierarchical disease triplet clusters with positive correlations with dementia onset were identified. We subsequently generated 19,438 features to construct prediction models, and the model with the best performance was support vector machine (SVM) with the by-group LASSO (least absolute shrinkage and selection operator) regression method (total corresponding features=2513; accuracy=0.615; sensitivity=0.607; specificity=0.622; positive predictive value=0.612; negative predictive value=0.619; area under the curve=0.639). In total, this study captured 49 hierarchical disease triplet clusters related to dementia development, and the most characteristic patterns leading to incident dementia started with cardiovascular conditions (mainly hypertension), cerebrovascular disease, mobility disorders, or infections, followed by neuropsychiatric conditions. Conclusions: Dementia development in the real world is an intricate process involving various diseases or conditions, their co-occurrence, and sequential relationships. Using a machine learning approach, we identified 49 hierarchical disease triplet clusters with leading roles (cardio- or cerebrovascular disease) and supporting roles (mental conditions, locomotion difficulties, infections, and nonspecific neurological conditions) in dementia development. Further studies using data from other countries are needed to validate the prediction algorithms for dementia development, allowing the development of comprehensive strategies to prevent or care for dementia in the real world.
AB - Background: Dementia development is a complex process in which the occurrence and sequential relationships of different diseases or conditions may construct specific patterns leading to incident dementia. Objective: This study aimed to identify patterns of disease or symptom clusters and their sequences prior to incident dementia using a novel approach incorporating machine learning methods. Methods: Using Taiwan’s National Health Insurance Research Database, data from 15,700 older people with dementia and 15,700 nondementia controls matched on age, sex, and index year (n=10,466, 67% for the training data set and n=5234, 33% for the testing data set) were retrieved for analysis. Using machine learning methods to capture specific hierarchical disease triplet clusters prior to dementia, we designed a study algorithm with four steps: (1) data preprocessing, (2) disease or symptom pathway selection, (3) model construction and optimization, and (4) data visualization. Results: Among 15,700 identified older people with dementia, 10,466 and 5234 subjects were randomly assigned to the training and testing data sets, and 6215 hierarchical disease triplet clusters with positive correlations with dementia onset were identified. We subsequently generated 19,438 features to construct prediction models, and the model with the best performance was support vector machine (SVM) with the by-group LASSO (least absolute shrinkage and selection operator) regression method (total corresponding features=2513; accuracy=0.615; sensitivity=0.607; specificity=0.622; positive predictive value=0.612; negative predictive value=0.619; area under the curve=0.639). In total, this study captured 49 hierarchical disease triplet clusters related to dementia development, and the most characteristic patterns leading to incident dementia started with cardiovascular conditions (mainly hypertension), cerebrovascular disease, mobility disorders, or infections, followed by neuropsychiatric conditions. Conclusions: Dementia development in the real world is an intricate process involving various diseases or conditions, their co-occurrence, and sequential relationships. Using a machine learning approach, we identified 49 hierarchical disease triplet clusters with leading roles (cardio- or cerebrovascular disease) and supporting roles (mental conditions, locomotion difficulties, infections, and nonspecific neurological conditions) in dementia development. Further studies using data from other countries are needed to validate the prediction algorithms for dementia development, allowing the development of comprehensive strategies to prevent or care for dementia in the real world.
KW - cardiovascular
KW - cluster analysis
KW - condition
KW - data
KW - data set
KW - dementia
KW - development
KW - disease
KW - infection
KW - machine learning
KW - mental conditions
KW - mobility
KW - neuropsychiatric
KW - symptoms
UR - http://www.scopus.com/inward/record.url?scp=85165717412&partnerID=8YFLogxK
U2 - 10.2196/41858
DO - 10.2196/41858
M3 - Article
C2 - 37494081
AN - SCOPUS:85165717412
SN - 1438-8871
VL - 25
JO - Journal of Medical Internet Research
JF - Journal of Medical Internet Research
M1 - e41858
ER -