Building a new hybrid machine learning model for improvement insurance cross-sell prediction
Authors
-
Doan Gia Bao Ngoc
University of Economics and Law, Ho Chi Minh CityVietnam National University, Ho Chi Minh City, VN
-
Luu Minh Quan
University of Economics and Law, Ho Chi Minh CityVietnam National University, Ho Chi Minh City, VN
-
Truong Thi Thanh Ha
University of Economics and Law, Ho Chi Minh CityVietnam National University, Ho Chi Minh City, VN
-
Nguyen Duc Minh Tan
University of Economics and Law, Ho Chi Minh CityVietnam National University, Ho Chi Minh City, VN
-
Phan Thi Minh Huyen
University of Economics and Law, Ho Chi Minh CityVietnam National University, Ho Chi Minh City, VN
-
Duy Thanh Tran
thanhtd@uel.edu.vn
University of Economics and Law, Ho Chi Minh CityVietnam National University, Ho Chi Minh City, VNhttps://orcid.org/0000-0003-0680-9452
DOI:
10.46223/HCMCOUJS.econ.en.16.1.4306.2026Keywords:
Borderline-SMOTE; cross-sell prediction; decision tree; hybrid model; logistic regression; random forest; ROC-AUC; XGBoostJEL Classification:
C53; E27; E37Abstract
Amid rising competition in the insurance sector, optimizing cross-selling strategies is crucial for sustainable growth and requires a deep understanding of customer behavior. This study proposes a machine learning-driven framework for cross-sell prediction to enhance personalization, increase conversion rates, and maximize return on investment. Using 381,109 customer records from an insurance company, the data undergoes preprocessing steps including outlier treatment for Annual Premium, encoding categorical variables such as Gender and Vehicle Age, and standardizing numerical features like Age, Annual Premium, and Vintage. To address class imbalance in the Response variable, where only 12.26 percent of customers responded positively, Borderline-Synthetic Minority Over-sampling Technique (Borderline-SMOTE) is applied to generate synthetic samples and improve prediction accuracy. Four machine learning models, including Logistic Regression, Decision Tree, Random Forest, and XGBoost, are trained and evaluated using Accuracy, Receiver Operating Characteristic - Area Under the Curve (ROC-AUC), Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error. Among these, XGBoost with Borderline-SMOTE achieves the best performance, with an accuracy of 0.84 and a ROC-AUC score of 0.8436, representing a significant improvement over the baseline XGBoost model with a ROC-AUC of 0.7768. Logistic Regression also improves, with its ROC-AUC increasing from 0.8250 to 0.8451. Visual analysis reveals behavioral patterns, such as a 25 percent purchase rate among customers with vehicles older than two years and a 20 percent rate among male customers with prior vehicle damage. The study delivers a high-performing predictive model to support targeted marketing efforts, potentially increasing cross-sell conversion rates by 5 to 10 percent. Future work will explore deep learning techniques and larger datasets to further enhance prediction capabilities.
Downloads
Download data is not yet available.References
Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Brockett, P. L., Golden, L. L., & Guillén, M. (2008). Genetic programming for cross-selling insurance products. Journal of Risk and Insurance, 75(3), 641-658. https://doi.org/10.1111/j.1539-6975.2008.00279.x
Capra, M., Bussolino, B., Marchisio, A., Masera, G., Martina, M., & Shafique, M. (2020). Hardware and software optimizations for accelerating deep neural Networks: Survey of current trends, challenges, and the road ahead. IEEE Access, 8, 225134-225180. https://doi.org/10.1109/access.2020.3039858
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321-357. https://doi.org/10.1613/jair.953
Dionne, G. (2013). Risk management: History, definition, and critique. Risk Management and Insurance Review, 16(2), 147-166.
Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. https://doi.org/10.48550/arXiv.1702.08608
Eling, M., & Kiesenbauer, D. (2014). What policy features determine life insurance lapse? An analysis of the German market. Journal of Risk and Insurance, 81(2), 241-269. https://doi.org/10.1111/j.1539-6975.2012.01504.x
Fernández, A., García, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges. Journal of Artificial Intelligence Research, 61(1), 863-905.
Frees, E. W., & Valdez, E. A. (2008). Hierarchical insurance claims modeling. Journal of the American Statistical Association, 103(484), 1457-1469.
Frees, E. W., & Wang, P. (2006). Copula credibility for aggregate loss models. Insurance: Mathematics and Economics, 38(2), 360-373. https://doi.org/10.1016/j.insmatheco .2005.10.001
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189-1232.
Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O’Reilly.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Guillén, M., Nielsen, J. P., & Pérez-Marín, A. M. (2008). The need to monitor customer loyalty and business risk in the European insurance industry. The Geneva Papers on Risk and Insurance, 33(2), 207-218.
Haag, F., Hopf, K., Vasconcelos, P. M., & Staake, T. (2022). Augmented cross-selling through explainable AI - A case from energy retailing. https://doi.org/10.48550/arXiv.2208.11404
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. https://doi.org/10.1109/TKDE.2008.239
Hugh Terry (2016). AI in insurance: Hype or reality. The Digital Insurer. https://www.the-digital-insurer.com/library/ai-in-insurance-hype-or-reality-pwc-report/
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.
Kaggle. (2020). Health Insurance cross-sell prediction. https://www.kaggle.com/datasets/ anmolkumar/health-insurance-cross-sell-prediction
Kamakura, W. A., Wedel, M., De Rosa, F., & Mazzon, J. A. (2003). Cross-selling through database marketing: A mixed data factor analyzer for data augmentation and prediction. International Journal of Research in Marketing, 20(1), 45-65.
KPMG Advisory (Hong Kong) Limited. (2023). Artificial intelligence in the insurance industry. KPMG.https://assets.kpmg.com/content/dam/kpmg/cn/pdf/en/2023/11/artificial-intelligence-in-the-insurance-industry.pdf
Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221-232. https://doi.org/10.1007/s13748-016-0094-0
Kumar, V., & Reinartz, W. (2018). Customer relationship management: Concept, strategy, and tools. Springer.
Li, S., Sun, B., & Wilcox, R. T. (2005). Cross-selling sequentially ordered products: An application to consumer banking services. Journal of Marketing Research, 42(2),
233-239.
López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250(20), 113-141. https://doi.org/10.1016/j.ins.2013.07.007
Molnar, C. (2020). Interpretable machine learning: A guide for making black box models explainable. Leanpub. https://christophm.github.io/interpretable-ml-book/
Naveed, S., Stevens, G., & Robin-Kern, D. (2024). An overview of the empirical evaluation of Explainable AI (XAI): A comprehensive guideline for user-centered evaluation in XAI. Applied Sciences, 14(23), Article 11288. https://www.mdpi.com/2076-3417/14/23/11288
Neslin, S. A., Gupta, S., Kamakura, W., Lu, J., & Mason, C. H. (2006). Defection detection: Measuring and understanding the predictive accuracy of customer churn models. Journal of Marketing Research, 43(2), 204-211. https://doi.org/10.1509/jmkr.43.2.204
Ngai, E. W. T., Xiu, L., & Chau, D. C. K. (2009). Application of data mining techniques in customer relationship management: A literature review and classification. Expert Systems with Applications, 36(2), 2592-2602. https://doi.org/10.1016/j.eswa.2008.02.021
Pham, L. (2022). Cross-Sell là gì? Bí quyết khiến khách hàng “rút hầu bao” trong nháy mắt! [What is Cross-Selling? Secrets to make customers “spend instantly”!]. ShopBase Blog. https://www.shopbase.com/blog/vn/cross-sell-la-gi.html
Provost, F., & Fawcett, T. (2013). Data science for business: What you need to know about data mining and data-analytic thinking. O’Reilly Media.
Raschka, S., & Mirjalili, V. (2019). Python machine learning (3rd ed.). Packt Publishing.
Rust, R. T., & Huang, M. H. (2014). The service revolution and the transformation of marketing science. Marketing Science, 33(2), 206-221.
Shen, S., Chen, Y., & Wang, L. (2023). Machine learning for enhanced credit risk assessment: An empirical approach. Journal of Risk and Financial Management, 16(12), Article 496. https://doi.org/10.3390/jrfm16120496
Thangnch.(n.d.).MiAI_Cross_Sell_Prediction.https://github.com/thangnch/MiAI_Cross_Sell_Prediction/blob/main/README.md
Tian, X., Todorovic, J., & Todorovic, Z. (2023). A machine-learning-based business analytical system for insurance customer relationship management and cross-selling. Journal of Applied Business & Economics, 25(6), 256-272.
Verhoef, P. C., & Lemon, K. N. (2013). Successful customer value management: Key lessons and emerging trends. European Management Journal, 31(1), 1-15.
Zhang, Z. (2016). Introduction to machine learning: K-nearest neighbors. Annals of Translational Medicine, 4(11), Article 218.
Downloads
Received: 13-04-2025Accepted: 02-06-2025Published: 07-09-2025Statistics Views
Abstract: 748 PDF: 55How to Cite
Doan, N. G. B., Luu, Q. M., Truong, H. T. T., Nguyen, T. D. M., Phan, H. T. M., & Tran, D. T. (2025). Building a new hybrid machine learning model for improvement insurance cross-sell prediction. HO CHI MINH CITY OPEN UNIVERSITY JOURNAL OF SCIENCE - ECONOMICS AND BUSINESS ADMINISTRATION, 16(1), 92–110. https://doi.org/10.46223/HCMCOUJS.econ.en.16.1.4306.2026License
Copyright (c) 2025 Ngoc Gia Bao Doan; Quan Minh Luu; Ha Thi Thanh Truong; Tan Duc Minh Nguyen; Huyen Thi Minh Phan; Thanh Duy Tran

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
