Nowadays, in many different fields, massive data are available and for several rea- sons, it might be convenient to analyze just a subset of the data. The application of the D-optimality criterion can be helpful to optimally select a subsample of observa- tions. However, it is well known that D-optimal support points lie on the boundary of the design space and if they go hand in hand with extreme response values, they can have a severe influence on the estimated linear model (leverage points with high influ- ence). To overcome this problem, firstly, we propose a non-informative “exchange” procedure that enables us to select a “nearly” D-optimal subset of observations with- out high leverage values. Then, we provide an informative version of this exchange procedure, where besides high leverage points also the outliers in the responses (that are not necessarily associated to high leverage points) are avoided. This is possible because, unlike other design situations, in subsampling from big datasets the response values may be available. Finally, both the non-informative and informative selection procedures are adapted to I-optimality, with the goal of getting accurate predictions.

Deldossi, L., Pesce, E., Tommasi, C., Accounting for outliers in optimal subsampling methods, <<STATISTICAL PAPERS>>, 2023; (64): 1119-1135. [doi:10.1007/s00362-023-01422-3] [https://hdl.handle.net/10807/233890]

Accounting for outliers in optimal subsampling methods

Deldossi, Laura
Primo
;
2023

Abstract

Nowadays, in many different fields, massive data are available and for several rea- sons, it might be convenient to analyze just a subset of the data. The application of the D-optimality criterion can be helpful to optimally select a subsample of observa- tions. However, it is well known that D-optimal support points lie on the boundary of the design space and if they go hand in hand with extreme response values, they can have a severe influence on the estimated linear model (leverage points with high influ- ence). To overcome this problem, firstly, we propose a non-informative “exchange” procedure that enables us to select a “nearly” D-optimal subset of observations with- out high leverage values. Then, we provide an informative version of this exchange procedure, where besides high leverage points also the outliers in the responses (that are not necessarily associated to high leverage points) are avoided. This is possible because, unlike other design situations, in subsampling from big datasets the response values may be available. Finally, both the non-informative and informative selection procedures are adapted to I-optimality, with the goal of getting accurate predictions.
2023
Inglese
Deldossi, L., Pesce, E., Tommasi, C., Accounting for outliers in optimal subsampling methods, <<STATISTICAL PAPERS>>, 2023; (64): 1119-1135. [doi:10.1007/s00362-023-01422-3] [https://hdl.handle.net/10807/233890]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/233890
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact