Dissertations

GET STARTED
1
Request Info
2
Visit
3
Apply

Information-Based Subsampling

Author: Xing Wei

Date: 3/08/2022

Executive Summary:
As massive quantities of data are produced in all kinds of fields, subsampling has caught many researchers' attention because its application could ensure people obtain useful information and knowledge more efficiently and effectively. Subsampling not only saves the computational resource without losing important information but is also quite essential for extraordinary large datasets due to computational limitations. Various methods are developed in the area, including both supervised and unsupervised techniques. Supervised methods integrate information by sampling data from regions most crucial for modeling the desired input-output relationship, while the unsupervised methods make the best use of the design matrix. This work focuses on the unsupervised subsampling techniques to tackle the more general case where the response variable is not available. We developed a novel information-based subsampling method that uses the D-optimality criteria for information-based subsampling. This method can minimize the estimation error and prediction error in the linearregression setting, and D-optimality criteria have various applications in the design of experiments (DOE).Starting with information-based sub-data selection (IBOSS) [Wang et al., 2018], it is superior to subsampling-based methods, sometimes by orders of magnitude. We first propose the principal component IBOSS (PCIBOSS) to improve the IBOSS method by handling the correlation structures in real data applications. Then, we focus on the D-optimality maximization problem and find multiple ways to achieve it. First, observe that the D-optimality subsampling problem can be cast into a determinant maximization problem that can be solved using the Lagrange multiplier and interior point methods. Second, we find the upper bound of the information and propose several different algorithms to achieve D-optimality.Further, we show the benefit of prescreening procedure before subsampling when data contain noise. We compare the performance of the proposed algorithms to the simple random sampling and IBOSS method under various settings via extensive simulation studies. In the real data example, we also demonstrate the superior predictive performance of our proposed algorithm.