Dissertations

GET STARTED
1
Request Info
2
Visit
3
Apply

On Improving Classification and Conditional Probability Estimation (CPE) for Imbalanced Data: Thresholding, Calibration, Performance Evaluation and Tree-Based Enhancement Methods

Author: Jiangnan Lyu

Date: 11/01/2022

Executive Summary:
One of the biggest obstacles to supervised machine learning is class imbalance. We are facing a dilemma that rare events, while usually more of interest and misclassification could be costly, are hard to gain information from because of their low representation among observations. Besides classification, conditional probability estimation (CPE) is another essential task in supervised machine learning. An accurate probability estimate is vital in decision-making tasks due to its interpretability and informativeness. One primary concern regarding most current methods addressing imbalanced data is their over-attention to classification and negligence in CPE. This thesis aims to show that CPE is a much more tedious yet more inclusive and essential task than classification. The outline of the thesis is as follows: We first provide a literature review of the imbalanced class problem and a thorough reference guide for choosing proper performance evaluation metrics. We then formulate a general calibration procedure for response-based sampling and propose two adaptive nonparametric calibration methods, AOL-Platt (Platt with Aranda-Ordaz link) and MSP-Iso (Isotonic regression with monotone smoothing P-splines). We also propose a tree-based synthetic minority oversampling technique (treeSMOTE) and an enhanced random forest algorithm that exploits tree split depth to improve the classification and CPE performance. Further discussion and thoughts relative to the imbalanced data will also be provided.