Due to unforeseen climate change, complicated chronic diseases, and mutation of viruses’ hospital administration’s top challenge is to know about the Length of stay (LOS) of different diseased patients in the hospitals. Hospital management does not exactly know when the existing patient leaves the hospital; this information could be crucial for hospital management. It could allow them to take more patients for admission. As a result, hospitals face many problems managing available resources and new patients in getting entries for their prompt treatment. Therefore, a robust model needs to be designed to help hospital administration predict patients’ LOS to resolve these issues. For this purpose, a very large-sized data (more than 2.3 million patients’ data) related to New-York Hospitals patients and containing information about a wide range of diseases including Bone-Marrow, Tuberculosis, Intestinal Transplant, Mental illness, Leukaemia, Spinal cord injury, Trauma, Rehabilitation, Kidney and Alcoholic Patients, HIV Patients, Malignant Breast disorder, Asthma, Respiratory distress syndrome, etc. have been analyzed to predict the LOS. We selected six Machine learning (ML) models named: Multiple linear regression (MLR), Lasso regression (LR), Ridge regression (RR), Decision tree regression (DTR), Extreme gradient boosting regression (XGBR), and Random Forest regression (RFR). The selected models’ predictive performance was checked using R square and Mean square error (MSE) as the performance evaluation criteria. Our results revealed the superior predictive performance of the RFR model, both in terms of RS score (92%) and MSE score (5), among all selected models. By Exploratory data analysis (EDA), we conclude that maximum stay was between 0 to 5 days with the meantime of each patient 5.3 days and more than 50 years old patients spent more days in the hospital. Based on the average LOS, results revealed that the patients with diagnoses related to birth complications spent more days in the hospital than other diseases. This finding could help predict the future length of hospital stay of new patients, which will help the hospital administration estimate and manage their resources efficiently.

Like any organization’s success is based on the updated information for its smooth functioning, in the same way, hospital administration’s utmost desire is to have updated data about the admitted patients and their stay in the hospitals. Since emergency cases are increasing day by day worldwide due to climate change as of COVID-19 [

Machine learning (ML) has been widely used to predict the future based on the past behavior of data. A variety of ML models have been used to predict the LOS of the patients, including unsupervised and supervised ML models [

In the past, different ML techniques have been used to predict hospital LOS. Patients' stay in the hospitals is expected to increase due to the increase in cardiovascular diseases and the population's ages. This problem affects the healthcare system, with hospitals facing decreased bed capacity, and as a result, the overall cost is increased. To address this issue, in [

Walczak et al. used ANN techniques (i.e., Backpropagation (BP), Radial-basis-function (RBF), and Fuzzy ARTMAP) for predicting illness level and hospital LOS of trauma patients. They found out that combinations of BP and fuzzy ARTMAP produced optimal results [

A comparative analysis of exciting techniques to predict the LOS has been shown in

Reference | ML models | Methodology | Results | Significance/ Limitations/ Suggestions |
---|---|---|---|---|

Bacchi et al. (2020), [ |
SVM, LR, MTL, and RFR | Data collection, Data Preprocessing, Train-Test split by 85:15 ratio, ML models implementation. | Highest accuracy achieved by SVM: 74 | Increase the size of the dataset. Build a more accurate ML model that could predict LOS and discharge destination more accurately. |

Nadeem et al. (2020), [ |
BP, RBF |
Data Collection, data Data preprocessing, different NN models have done LOS prediction. | A combination of BP and fuzzy ARTMAP produce optimal results. | A combination of BP and fuzzy ARTMAP recommended producing optimal results |

Daghistani et al. (2019), [ |
RFR, SVM, ANN, BN | Data Collection, Data Preprocessing, Feature Feature selection by IG model, ML models implementation. | Highest accuracy achieved by RFR model: 80%. | The small number of features (#20). This methodology can be used for hospital bed management and funds distribution. |

Chuang et al., (2018), [ |
RFR, SVM, LGR | Data Collection, Data Preprocessing, Feature selection, predict LOS by different ML techniques. | Effective variables |
Data collected from a single medical organization and focus only on one disease. |

Morton et al. (2014), [ |
SVM, MTL, MLR, RF | Data collection, Feature Engineering, Implementation of ML models. Select a robust model. | Highest accuracy achieved by SVM: 68% | Conduct a more accurate feature selection algorithm, increase the dataset’s size and investigate other ML models like ANN, LR, etc. |

Patel et al. (2013), [ |
LR | LR predicted data Collection, LOS, and inpatients mortality. | Mortality Rate: 2.09%, Median LOS: 2.77 days | The model can be performed more accurately if data from different datasets are included. |

Yang et al., (2010), [ |
SVM, LR | Data Collection, Feature Selection, Predicting the LOS at different stages by ML models. | SVM model outperformed only in specific scenarios ( |
Only used the dataset of one burn centre of Taiwan. The performance of the model can be strengthened if data collected from different burn centres. |

For the general recommendations to the hospital administration, we have selected a large dataset,i.e., more than 2.3 million patients, and included a range of diseases including Heart Transplant, Lungs Transplant, Burt Patients, Bone Marrow Transplant, Mental illness diagnoses, Liver Transplant, Intestinal Transplant, Schizophrenia, Respiratory System Diagnosis, Acute Leukemia, Eating disorder, Bipolar disorder, Trauma, Spinal disorder & injuries, Rehabilitation, Kidney Patients, Alcoholic Patients, Dialysis Patients, Skin Patients, HIV Patients, Malignant Breast disorder, Asthma, Cardiac/Heart-Patient, Cancer, Illness Severity, Surgery, Accident Patients, Respiratory distress syndrome, Abnormal Patients, etc. Above data is related to New-York hospitals. It contains patients’ information such as duration of stay, gender, age, race, ethnicity, type of admission, discharge year, and some other essential variables. The main objectives of this study are to explore the dataset to find the hidden patterns of variables and apply different supervised ML models to identify a robust model to make future predictions of the hospital LOS of different diseased patients. In this study, we also calculate the feature importance score by RFR model to identify which features among all the features are relevant to the hospital length of stay.

The framework of the proposed study to predict the LOS of the patients is presented in

In this study, we have used Inpatient De-identified data from

It is essential for data analysis that the used data be correct and complete because missing values in the data negatively affect the model’s performance. For this purpose, the data set used for this study was checked, and missing values were identified. It was noticed that among all the variables listed in

Variables of dataset | Variables type | Description of variables | A correlation value of each independent variable with LOS | identification of missing values |
---|---|---|---|---|

Hospital Service Area | String | It means where the patient has been kept for care. | 0.008526 | 5155 |

Hospital County | String | It means a hospital located in which County. | 0.004111 | 5155 |

Operating Certificate Number | Integer | The authorized number for operation | 0.026698 | 5155 |

Permanent Facility ID | String | Facility ID assigned to a patient | 0.020610 | 5155 |

Facility Name | String | Assigned Centre | –0.006707 | 0 |

Age Group | Float64 | Distribution of patients in groups as per their ages, |
0.093445 | 0 |

Zipcode | Float64 | Zip code-3digit | –0.019186 | 39019 |

Gender | String | Male or female patient | 0.051873 | 0 |

Race | String | identification by color, |
–0.039325 | 0 |

Ethnicity | String | The ethnicity of a patient: Spanish/Hispanic | –0.009946 | 0 |

Length of Stay | String | Count of days a patient stay in hospital | 1.000000 | 0 |

Type of Admission | String | Elective or emergency admission of a patient | 0.017997 | 0 |

Patient Disposition | String | Home of self-care/ skilled nursing home | 0.159197 | 0 |

Discharged Year | String | Year in which the patient was discharged | 0.00000 | 0 |

CCS Diagnosis Code | String | Diagnosis code assigned to a patient | –0.012021 | 0 |

CCS Diagnostic Description | String | Description of diagnosis of each patient | 0.036797 | 0 |

CCS Procedure Code | Integer | Procedure code of CSS of each patient | 0.058564 | 0 |

CCS Procedure Description | Integer | Description of CSS procedure of each patient | 0.071875 | 0 |

APR DRG Code | String | DRG code assigned to each patient | 0.043900 | 0 |

APR DRG Description | Integer | Description of ARR DRG of each patient | 0.005905 | 0 |

APR MDC Code | String | MDC code assigned to each patient | 0.082670 | 0 |

APR MDC Description | Integer | Description of ARR MDC of each patient | 0.002133 | 0 |

APR Severity of Illness Code | String | Illness code assigned to each patient | 0.326485 | 0 |

APR Severity of Illness Description | Integer | Description of ARR illness | –0.239981 | 240 |

APR Risk of Mortality | String | Level of mortality risk: minor/moderate | –0.191293 | 240 |

APR Medical Surgical Description | Integer | The medical description of APR | 0.044086 | 0 |

Payment Typology 1 | String | Typology 1 payment method | 0.027721 | 0 |

Payment Typology 2 | String | Typology 2 payment method | 0.00000 | 878722 |

Payment Typology 3 | String | Typology 3 payment method | 0.00000 | 1737244 |

Birth Weight | String | Weight at the time of birth | 0.00000 | 2115685 |

Abortion Edit Indicator | String | Abortion edit indicator exists in case of each patient or not | 0.00000 | 0 |

Emergency Department Indicator | String | Emergency department indicator: yes/no | 0.052074 | 0 |

Total Charges | String | Total fee paid by a patient at the time of discharge | 0.466402 | 0 |

Total Costs | String | The total cost that hospital has to bear | 0.517272 | 0 |

Exploratory data analysis (EDA) was used to analyze the dataset and summarize the dataset’s main variables [

As we can see from

Since LOS is the output variable, we kept this variable along the y-axis of the plots created for Data Visualization. For example, in the dataset, the LOS of a patient with more than four months’ stay was given as 120+. Since exact days are not given in the dataset, we replaced 120+ with 130 to avoid the error.

Univariate analysis (UA) was used to explore variables of the dataset{Park, 2015 #51}. UA summarizes each variables’ dataset and identifies the hidden patterns of the dataset. In this study, as we can see in

Next, we performed the bivariate analyses to check the relationship between independent and output variables (LOS) using bar graphs. We have displayed the bar graphs in

In contrast, the patients who belong to the “not available” category of admission spent the minimum number of days in the hospital on average. The average LOS based on APR risk of mortality is shown in

Feature selection is an essential part of building a good model. ML requires important variables for training the model. There were a total of 34 variables in the patient’s dataset. After cleaning the dataset and performing EDA, some variables were removed due to a high count of missing values. The EDA helped gain further insights into the data. We used the Mutual Information (MI) regression technique to check the mutual dependence of input variables on the dependant variable (LOS). Information gain of all independent variables is shown in

In this study, since the dataset is taken from the medical hospitals has an output in the form of a continuous numerical value; therefore, supervised ML regression algorithms were used to make predictions of the patient’s LOS. The chosen ML algorithms in this study are MLR, LR, RR, DTR, XGBR, and RFR, respectively.

The multiple Linear Regression (MLR) model is an extension of Linear Regression (LR) which predicts a numeric value using more than one independent variable [

where “

Lasso regression (LR) model is a subtype of the linear regression model used to shrink the number of coefficients of the regression model. LR model is also used as a regularized regression model, which results in a sparse model with fewer coefficients. It makes some of the coefficients equal to zero, which are not contributing much to the predictions. As a result, the model becomes simpler, which performs better than the unregularized MLR model [

Ridge regression (RR) is another particular case of linear regression model that helps shrink the coefficients and reducing the model’s complexity. It also helps in reducing multicollinearity. Unlike the LR model, the RR model does not provide absolute shrinkage of the coefficients. However, the RR model makes some of the coefficient values very low or close to zero. Therefore, the features which are not contributing much to the model will have very low coefficients. As a result, the RR model helps in reduces overfitting, which appears from the MLR model [

Decision tree regression (DTR) is a famous ML model used for classification and regression problems. DTR builds a tree-shaped structure of variables. DTR model breaks the data into smaller subsets, and the associated decision tree is incrementally developed simultaneously [

Random forest regression (RFR) model is a collection of multiple decision trees. RFR model is an estimator that fits several classifying decisions on the subsamples of the data and uses averaging criteria to improve the accuracy and control overfitting problems [

For parameter tuning, Cross-validation (CV) is a very useful technique used in ML modeling, and most of the time, it performs better than the standard validation set approach. It divides the data into k folds,

After fitting the models, the next step is to measure the performances of the models. Two important performance measuring techniques,

Here n denotes the number of training points,

The second performance measurement technique is the R-square score, also known as the coefficient of determination. R-square has a value between 0 and 1. RS tells us how well a line fits the data or how well a line follows the variations within a set of data [

SSRES denotes the sum of squares of residuals, and SSTOT denotes the sum of squares of the total. R-square value of 1 indicates a perfectly best-fitted model, while a score of 0 says the model was unable to fit the data and it is a poorly fitted model.

After selecting essential variables of the dataset, six selected models,

Multiple linear regression (MLR), as mentioned before, was trained and validated using a 10-fold CV for the prediction of LOS. Then MLR model was used to predict the

Lasso regression (LR) model was applied in a way very similar to the MLR model. LR model showed an MSE of 42.58 and an R-square score of 0.31 for the training data. For the test data, the LR model showed an MSE of 42.19 and an R-square score of 0.310. Thus, for both cases (training and testing), MSE was even higher than MLR, and the R-square score was very low, resulting in low model performance.

The Ridge regression (RR) model showed an MSE of 39 and an R-square score of 0.37 for the training data. However, it showed an MSE of 38.49, and the R-square score was 0.3711 for the test data. Since these results were also far from ideal, RR model performance was also low.

The Decision tree regression (DTR) Model showed an MSE of 0.002 and an R-square score of 0.999 for the training data. However, it showed an MSE of 5.93, and the R-square score was 0.903 for the test data. Since these results were relatively close to the ideal, this model's performance was much better than MLR, LR, and RR.

The Extreme gradient boosting regression (XGBR) model showed an MSE of 5.32 and an R-square score of 0.914 for the training data. However, it showed an MSE of 5.62, and the R-square score was 0.908 for the test data. As the readings indicate, XGBR performed better than all the previous models.

Random forest regression (RFR) model was also applied in the same way as other models. RFR model showed an MSE of 0.76 and an R-square score of 0.987 for the training data. However, it showed an MSE of 5 and an R-square score of 0.92 for the test data. These results indicate the superior predictive performance of the RFR method as compared to other models.

We have seen that MLR, LR, and RR models could not perform well, as indicated by large MSE and small R-square scores. However, the other two models, i.e., DTR and XGBR, were better in terms of these performance measures, as presented in

From

Performance scores of supervised machine learning regression models | ||||
---|---|---|---|---|

Supervised regression models | Training data score | Test data scores | ||

MSE | RS | MSE | RS | |

MLR | 39 | 0.37 | 38.49 | 0.371 |

LR | 42.58 | 0.31 | 42.19 | 0.310 |

RR | 39 | 0.37 | 38.49 | 0.3711 |

DTR | 0.002 | 0.999 | 5.93 | 0.903 |

XGBR | 5.30 | 0.914 | 5.62 | 0.908 |

RFR | 0.76 | 0.987 | 5 | 0.92 |

Features importance is an important technique used to identify which features/variables among all the features/variables are relevant in making predictions. Feature prediction scores were calculated using the RFR model [

In this study, the main objectives were to explore the Inpatient De-identified data and to build a robust model that could predict the hospital LOS of patients coming to the hospital in the future. Predicting hospital length of stay will help hospitals estimate resources available for the patients and manage the available resources efficiently. EDA with the help of graphs was performed to develop essential insights from the data. By EDA, we conclude that maximum stay was between 0 to 5 days with the meantime of each patient 5.3 days and more than 50 years old patients spent more days in the hospital. Based on the average LOS, it was also observed that the patients with diagnoses related to birth complications spent more days in the hospital than other diseases. Six ML models were employed and evaluated by using the 10-fold CV approach. Linear multiple regression (LMR), Lasso regression (LR), Ridge regression (RR), Decision tree regression (DTR), Extreme gradient boosting regression (XGBR), and Random forest regression (RFR) were the chosen models in this analysis. The results showed that RFR was the best model for R-square and MSE, followed by the XGBR. Feature importance score revealed the relevance of three primary variables, Total Costs, CCS Diagnoses Code, and Total Charges, for predicting the LOS. Based on the above-detailed study, we recommend that future work involve more variables in the given dataset to build a more accurate model that could predict hospital LOS more accurately.

Thanks to the supervisor and co-authors for their valuable guidance and support.