Liu, Z., Fauvel, C., Lin, S., Correa-Jaque, P., Webb, A., Vanderpool, R., Kanwar, M., Kraisangka, J., Mathur, P., Perer, A., Everett, A., & Benza, R. L. “Pulmonary Arterial Hypertension Risk Assessment model using Random Forrest and Bayesian Network”.

Background and rationale: Existing risk assessment models in pulmonary arterial hypertension (PAH) are powerful in predicating survival. Yet, they are limited in accounting for the internal relationships among variables since they typically assume independence between variables and their linear association with outcomes. To break free of these limits, we built a clinical risk assessment model using machine learning methods.
Methods and results: We harmonized the clinical data measured at baseline from seven adult PAH trials: GRIPHON, SERAPHIN, EARLY, COMPASS-2, COMPASS-3, MAESTRO and TRANSIT-1. The harmonized data comprised 2,870 subjects (mean age 43 years, 77% female, 50% idiopathic or familial PAH) and 125 clinical variables, with a mortality rate of 14%. We split the whole data set into 80% training and 20% test data. Using the training data, we studied variable importance in predicting time-to-mortality using Random Forest. Sixteen variables with importance value greater than 0.0015 were selected to construct a Bayesian network (BN) (Figure) in predicting 1-year survival status (primary outcome). The thresholds to discretize continuous variables were determined based on clinical knowledge. The BN obtained an AUC of 0.85 on the test data set. In 5-fold cross validation, the average AUC was 0.77.
Conclusion: Machine learning provides new powerful methods to build PAH risk assessment model, taking the interdependence among variables into account.
