robpy.covariance
Base
- class robpy.covariance.base.RobustCovariance(*, store_precision: bool = True, assume_centered: bool = False, nans_allowed: bool = False)[source]
Bases:
EmpiricalCovarianceBase class for robust covariance estimators.
- Parameters:
store_precision (boolean, optional) – Whether to store the precision matrix. Defaults to True.
assume_centered (boolean, optional) – Whether the data is already centered. Defaults to False.
nans_allowed (boolean, optional) – Attribute specifying if nans are allowed. Defaults to False.
- property correlation: ndarray
- property covariance: ndarray
- fit(X: ndarray | DataFrame) RobustCovariance[source]
Fit the covariance estimator.
- Parameters:
X (np.ndarray | pd.DataFrame) – Data matrix.
Cell MCD
- class robpy.covariance.cellmcd.CellMCD(*, alpha: float = 0.75, quantile: float = 0.99, crit: float = 0.0001, max_c_steps: int = 100, min_eigenvalue: float = 0.0001, verbosity: int = 30)[source]
Bases:
RobustCovarianceCell MCD estimator based on the algorithm proposed in Raymaekers, J., & Rousseeuw, P. J. (2024).
- Parameters:
alpha (float, optional) – Percentage indicating how many cells must remain unflagged in each column. Must lie within 0.5 to 1.0. Defaults to 0.75.
quantile (float, optional) – Cutoff value to flag cells. Defaults to 0.99.
crit (float, optional) – Stop iterating when successive covariance matrices of the standardized data differ by less than crit. Defaults to 1e-4.
max_c_steps (int, optional) – Maximum number of C-steps used in the algorithm. Defaults to 100.
min_eigenvalue (float, optional) – Lower bound on the minimum eigenvalue of the covariance estimator on the standardized data. Should be at least 1e-6. Defaults to 1e-4.
References
Raymaekers, J., & Rousseeuw, P. J. (2024). The cellwise minimum covariance determinant estimator. Journal of the American Statistical Association, 119(548), 2610-2621.
- cell_MCD_plot(variable: int, variable_name: str = 'variable', row_names: list | None = None, second_variable: int | None = None, second_variable_name: str = 'second variable', plottype: Literal['indexplot', 'residuals_vs_variable', 'residuals_vs_predictions', 'variable_vs_predictions', 'bivariate'] = 'indexplot', figsize: tuple[int, int] = (8, 8), annotation_quantile: float | None = None)[source]
Function to plot the results of a cell MCD analysis: 5 types of diagnostic plots.
- Parameters:
variable (int) – Index of the variable under consideration.
variable_name (str, optional) – Name of the variable of interest for the axis label. Defaults to “variable”.
row_names (list of strings, optional) – Row names of the observations if you want the outliers annoted with their name.
second_variable (int, optional) – Index of the second variable under consideration, only needed for plottype “bivariate”.
second_variable_name (str, optional) – Name of the second variable for the axis label, only relevant for plottype “bivariate”. Defaults to “second variable”.
plottype (Literal string, optional) –
“indexplot”: plots the residuals of a variable versus the case numbers.
”residuals_vs_variable”: plots the residuals of a variable versus the variable itself.
”residuals_vs_predictions”: plots the residuals of a variable versus the predictions of that variable.
”variable_vs_predictions”: plots a variable against its predictions.
”bivariate”: plots two variables against each other.
Defaults to “indexplot”.
figsize (tuple[int,int], optional) – Size of the figure. Defaults to (8,8).
annotation_quantile (float | None, optional) – The quantile used to draw an imaginary threshold around the data. Only points outside these thresholds will be annotated. If None, use self.quantile.
- class robpy.covariance.initial_ddcw.InitialDDCW(*, alpha: float = 0.75, min_eigenvalue: float = 0.0001, verbosity: int = 30)[source]
Bases:
RobustCovarianceCalculates the initial robust scatter and location estimates for the CellMCD, described in the Supplementary Material to Raymaekers, J., & Rousseeuw, P. J. (2024). The code is based on the function cellWise:::DDCWcov in R.
- Parameters:
alpha (float, optional) – Percentage indicating how much cells must remain unflagged in each column. Must lie within 0.5 to 1.0. Defaults to 0.75.
min_eigenvalue (float, optional) – Lower bound on the minimum eigenvalue of the covariance estimator on the standardized data. Should be at least 1e-6. Defaults to 1e-4.
References
Raymaekers, J., & Rousseeuw, P. J. (2024). The cellwise minimum covariance determinant estimator. Journal of the American Statistical Association, 119(548), 2610-2621.
Kendall’s Tau
- class robpy.covariance.kendall.KendallTau[source]
Bases:
RobustCovarianceEstimate a covariance matrix using Kendall’s tau pairwise correlation.
Minumum Covariance Determinant
- class robpy.covariance.mcd.DetMCD(*, alpha: float | int | None = None, tolerance: float = 1e-08, correct_covariance: bool = True, reweighting: bool = True, verbosity: int = 30)[source]
Bases:
RobustCovarianceDeterministic MCD estimator (DetMCD) based on the algorithm proposed in Hubert, M., Rousseeuw, P. J., & Verdonck, T. (2012).
- Parameters:
alpha (float | int | None, optional) – Size of the h subset. If an integer between n/2 and n is passed, it is interpreted as h. If a float between 0.5 and 1 is passed, it is interpreted as a proportion of n (the training set size). If None or an integer below [(n+p+1)/2], h is set to [(n+p+1)/2]. Defaults to None.
tolerance (float, optional) – Minimum difference in determinant between two iterations to stop the C-step. Defaults to 1e-8.
correct_covariance (bool, optional) – Whether to apply a consistency correction to the raw covariance estimate. Defaults to True.
reweighting (bool, optional) – Whether to apply reweighting to the raw covariance estimate. Defaults to True.
References
Hubert, M., Rousseeuw, P. J., & Verdonck, T. (2012). A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics, 21(3), 618-637.
- class robpy.covariance.mcd.FastMCD(*, alpha: float | int | None = None, n_initial_subsets: int = 500, n_initial_c_steps: int = 2, n_best_subsets: int = 10, n_partitions: int | None = None, tolerance: float = 1e-08, correct_covariance: bool = True, reweighting: bool = True, verbosity: int = 30, store_precision=True, assume_centered=False, random_seed: int | None = None)[source]
Bases:
RobustCovarianceFast MCD estimator based on the algorithm proposed in Rousseeuw, P. J., & Van Driessen, K. (1999).
- Parameters:
alpha (float | int | None, optional) – Size of the h subset. If an integer between n/2 and n is passed, it is interpreted as h. If a float between 0.5 and 1 is passed, it is interpreted as a proportion of n (the training set size). If None or an integer below [(n+p+1)/2], h is set to [(n+p+1)/2]. Defaults to None.
n_initial_subsets (int, optional) – Number of initial random subsets of size p+1. Defaults to 500.
n_initial_c_steps (int, optional) – Number of initial c steps to perform on all initial subsets. Defaults to 2.
n_best_subsets (int, optional) – Number of best subsets to keep and perform c steps on until convergence. Defaults to 10.
n_partitions (int, optional) – Number of partitions to split the data into. This can speed up the algorithm for large datasets (n > 600 suggested in paper). If None, 5 partitions are used if n > 600, otherwise 1 partition is used.
tolerance (float, optional) – Minimum difference in determinant between two iterations to stop the C-step. Defaults to 1e-8.
correct_covariance (bool, optional) – Whether to apply a consistency correction to the raw covariance estimate. Defaults to True.
reweighting (bool, optional) – Whether to apply reweighting to the raw covariance estimate. Defaults to True.
random_seed (int | None, optional) – Can be used to provide a random seed. Defaults to None.
References
Rousseeuw, P. J., & Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), 212-223.
Orthogonalized Gnanadesikan-Kettenring
- class robpy.covariance.ogk.OGK(*, store_precision=True, assume_centered=False, location_estimator: ~robpy.univariate.base.LocationOrScaleEstimator = <function median>, scale_estimator: ~robpy.univariate.base.LocationOrScaleEstimator = <function median_abs_deviation>, n_iterations: int = 2, reweighting: bool = False, reweighting_beta: float = 0.9)[source]
Bases:
RobustCovarianceImplementation of the Orthogonalized Gnanadesikan-Kettenring estimator for location and dispersion proposed in Maronna, R. A., & Zamar, R. H. (2002).
- Parameters:
store_precision (boolean, optional) – Whether to store the precision matrix. Defaults to True.
assume_centered (boolean, optional) – Whether the data is already centered. Defaults to False.
location_estimator (LocationOrScaleEstimator, optional) – Function to estimate the location of the data, should accept an array like input as first value and a named argument axis. Defaults to np.median.
scale_estimator (LocationOrScaleEstimator, optional) – Function to estimate the scale of the data, should accept an array like input as first value and a named argument axis. Defaults to median_abs_deviation.
n_iterations (int, optional) – Number of iterations for the orthogonalization step. Defaults to 2.
reweighting (boolean, optional) – Whether to apply reweighting at the end (i.e. calculating regular location and covariance after filtering outliers based on Mahalanobis distance using OGK estimates). Defaults to False.
reweighting_beta (float, optional) – Quantile of chi-squared distribution to use as cutoff for the reweighting. Defaults to 0.9.
References
Maronna, R. A., & Zamar, R. H. (2002). Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44(4), 307-317.
Wrapping Covariance
- class robpy.covariance.wrapping.WrappingCovariance(b: float = 1.5, c: float = 4.0, q1: float = 1.540793, q2: float = 0.8622731, store_precision: bool = True, assume_centered: bool = False)[source]
Bases:
RobustCovarianceCovariance estimator based on the wrapping function proposed in Raymaekers, J., & Rousseeuw, P. J. (2021).
The wrapping transformation is defined as follows:
\[\begin{split}\Psi_{b, c}(z) = \begin{cases} z & \text{if } \ 0 \leq |z| < b, \\ q_1 \tanh\left(q_2 (c - |z|)\right) \mathrm{sign}(z) & \text{if } \ b \leq |z| \leq c,\\ 0 & \text{if } \ c < |z|. \end{cases}\end{split}\]The data is first scaled using the median and the MAD before applying the transformation.
Next, the robust covariance of X is computed from the classical covariance on the transformed data:
\[Cov(Median(X) + MAD(X) * \Psi_{b, c}(X - Median(X) / MAD(X)))\]- Parameters:
b (float, optional) – Lower cutoff. Defaults to 1.5.
c (float, optional) – Upper cutoff. Defaults to 4.0.
q1 (float, optional) – Transformation parameter (see formula). Defaults to 1.540793.
q2 (float, optional) – Transformation parameter (see formula). Defaults to 0.8622731.
store_precision (bool, optional) – Whether to store the precision matrix. Defaults to True.
assume_centered (bool, optional) – Whether the data is already centered. Defaults to False.
References
Raymaekers, J., & Rousseeuw, P. J. (2021). Fast robust correlation for high-dimensional data. Technometrics, 63(2), 184-198.
- calculate_covariance(X: ndarray) ndarray[source]
Calculate the covariance matrix of the data X after applying the wrapping transformation.
- Parameters:
X (np.ndarray) – Data to calculate the covariance matrix of, must have shape (n, p).
- Returns:
Robust covariance matrix of X.
- Return type:
np.ndarray