robpy.covariance

Base

class robpy.covariance.base.RobustCovariance(*, store_precision: bool = True, assume_centered: bool = False, nans_allowed: bool = False)[source]

Bases: EmpiricalCovariance

Base class for robust covariance estimators.

Parameters:
  • store_precision (boolean, optional) – Whether to store the precision matrix. Defaults to True.

  • assume_centered (boolean, optional) – Whether the data is already centered. Defaults to False.

  • nans_allowed (boolean, optional) – Attribute specifying if nans are allowed. Defaults to False.

calculate_covariance(X) ndarray[source]
property correlation: ndarray
property covariance: ndarray
distance_distance_plot(chi2_percentile: float = 0.975, figsize: tuple[int, int] = (4, 4))[source]
fit(X: ndarray | DataFrame) RobustCovariance[source]

Fit the covariance estimator.

Parameters:

X (np.ndarray | pd.DataFrame) – Data matrix.

Cell MCD

class robpy.covariance.cellmcd.CellMCD(*, alpha: float = 0.75, quantile: float = 0.99, crit: float = 0.0001, max_c_steps: int = 100, min_eigenvalue: float = 0.0001, verbosity: int = 30)[source]

Bases: RobustCovariance

Cell MCD estimator based on the algorithm proposed in Raymaekers, J., & Rousseeuw, P. J. (2024).

Parameters:
  • alpha (float, optional) – Percentage indicating how many cells must remain unflagged in each column. Must lie within 0.5 to 1.0. Defaults to 0.75.

  • quantile (float, optional) – Cutoff value to flag cells. Defaults to 0.99.

  • crit (float, optional) – Stop iterating when successive covariance matrices of the standardized data differ by less than crit. Defaults to 1e-4.

  • max_c_steps (int, optional) – Maximum number of C-steps used in the algorithm. Defaults to 100.

  • min_eigenvalue (float, optional) – Lower bound on the minimum eigenvalue of the covariance estimator on the standardized data. Should be at least 1e-6. Defaults to 1e-4.

References

  • Raymaekers, J., & Rousseeuw, P. J. (2024). The cellwise minimum covariance determinant estimator. Journal of the American Statistical Association, 119(548), 2610-2621.

calculate_covariance(X: ndarray) ndarray[source]
cell_MCD_plot(variable: int, variable_name: str = 'variable', row_names: list | None = None, second_variable: int | None = None, second_variable_name: str = 'second variable', plottype: Literal['indexplot', 'residuals_vs_variable', 'residuals_vs_predictions', 'variable_vs_predictions', 'bivariate'] = 'indexplot', figsize: tuple[int, int] = (8, 8), annotation_quantile: float | None = None)[source]

Function to plot the results of a cell MCD analysis: 5 types of diagnostic plots.

Parameters:
  • variable (int) – Index of the variable under consideration.

  • variable_name (str, optional) – Name of the variable of interest for the axis label. Defaults to “variable”.

  • row_names (list of strings, optional) – Row names of the observations if you want the outliers annoted with their name.

  • second_variable (int, optional) – Index of the second variable under consideration, only needed for plottype “bivariate”.

  • second_variable_name (str, optional) – Name of the second variable for the axis label, only relevant for plottype “bivariate”. Defaults to “second variable”.

  • plottype (Literal string, optional) –

    • “indexplot”: plots the residuals of a variable versus the case numbers.

    • ”residuals_vs_variable”: plots the residuals of a variable versus the variable itself.

    • ”residuals_vs_predictions”: plots the residuals of a variable versus the predictions of that variable.

    • ”variable_vs_predictions”: plots a variable against its predictions.

    • ”bivariate”: plots two variables against each other.

    Defaults to “indexplot”.

  • figsize (tuple[int,int], optional) – Size of the figure. Defaults to (8,8).

  • annotation_quantile (float | None, optional) – The quantile used to draw an imaginary threshold around the data. Only points outside these thresholds will be annotated. If None, use self.quantile.

class robpy.covariance.initial_ddcw.InitialDDCW(*, alpha: float = 0.75, min_eigenvalue: float = 0.0001, verbosity: int = 30)[source]

Bases: RobustCovariance

Calculates the initial robust scatter and location estimates for the CellMCD, described in the Supplementary Material to Raymaekers, J., & Rousseeuw, P. J. (2024). The code is based on the function cellWise:::DDCWcov in R.

Parameters:
  • alpha (float, optional) – Percentage indicating how much cells must remain unflagged in each column. Must lie within 0.5 to 1.0. Defaults to 0.75.

  • min_eigenvalue (float, optional) – Lower bound on the minimum eigenvalue of the covariance estimator on the standardized data. Should be at least 1e-6. Defaults to 1e-4.

References

  • Raymaekers, J., & Rousseeuw, P. J. (2024). The cellwise minimum covariance determinant estimator. Journal of the American Statistical Association, 119(548), 2610-2621.

calculate_covariance(X: ndarray)[source]

Kendall’s Tau

class robpy.covariance.kendall.KendallTau[source]

Bases: RobustCovariance

Estimate a covariance matrix using Kendall’s tau pairwise correlation.

calculate_covariance(X) ndarray[source]

Minumum Covariance Determinant

class robpy.covariance.mcd.DetMCD(*, alpha: float | int | None = None, tolerance: float = 1e-08, correct_covariance: bool = True, reweighting: bool = True, verbosity: int = 30)[source]

Bases: RobustCovariance

Deterministic MCD estimator (DetMCD) based on the algorithm proposed in Hubert, M., Rousseeuw, P. J., & Verdonck, T. (2012).

Parameters:
  • alpha (float | int | None, optional) – Size of the h subset. If an integer between n/2 and n is passed, it is interpreted as h. If a float between 0.5 and 1 is passed, it is interpreted as a proportion of n (the training set size). If None or an integer below [(n+p+1)/2], h is set to [(n+p+1)/2]. Defaults to None.

  • tolerance (float, optional) – Minimum difference in determinant between two iterations to stop the C-step. Defaults to 1e-8.

  • correct_covariance (bool, optional) – Whether to apply a consistency correction to the raw covariance estimate. Defaults to True.

  • reweighting (bool, optional) – Whether to apply reweighting to the raw covariance estimate. Defaults to True.

References

  • Hubert, M., Rousseeuw, P. J., & Verdonck, T. (2012). A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics, 21(3), 618-637.

calculate_covariance(X: ndarray) ndarray[source]
class robpy.covariance.mcd.FastMCD(*, alpha: float | int | None = None, n_initial_subsets: int = 500, n_initial_c_steps: int = 2, n_best_subsets: int = 10, n_partitions: int | None = None, tolerance: float = 1e-08, correct_covariance: bool = True, reweighting: bool = True, verbosity: int = 30, store_precision=True, assume_centered=False, random_seed: int | None = None)[source]

Bases: RobustCovariance

Fast MCD estimator based on the algorithm proposed in Rousseeuw, P. J., & Van Driessen, K. (1999).

Parameters:
  • alpha (float | int | None, optional) – Size of the h subset. If an integer between n/2 and n is passed, it is interpreted as h. If a float between 0.5 and 1 is passed, it is interpreted as a proportion of n (the training set size). If None or an integer below [(n+p+1)/2], h is set to [(n+p+1)/2]. Defaults to None.

  • n_initial_subsets (int, optional) – Number of initial random subsets of size p+1. Defaults to 500.

  • n_initial_c_steps (int, optional) – Number of initial c steps to perform on all initial subsets. Defaults to 2.

  • n_best_subsets (int, optional) – Number of best subsets to keep and perform c steps on until convergence. Defaults to 10.

  • n_partitions (int, optional) – Number of partitions to split the data into. This can speed up the algorithm for large datasets (n > 600 suggested in paper). If None, 5 partitions are used if n > 600, otherwise 1 partition is used.

  • tolerance (float, optional) – Minimum difference in determinant between two iterations to stop the C-step. Defaults to 1e-8.

  • correct_covariance (bool, optional) – Whether to apply a consistency correction to the raw covariance estimate. Defaults to True.

  • reweighting (bool, optional) – Whether to apply reweighting to the raw covariance estimate. Defaults to True.

  • random_seed (int | None, optional) – Can be used to provide a random seed. Defaults to None.

References

  • Rousseeuw, P. J., & Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3), 212-223.

calculate_covariance(X) ndarray[source]
class robpy.covariance.mcd.HSubset(indices: numpy.ndarray, location: numpy.ndarray, scale: numpy.ndarray, determinant: float, n_c_steps: int = 0)[source]

Bases: object

determinant: float
indices: ndarray
location: ndarray
n_c_steps: int = 0
scale: ndarray

Orthogonalized Gnanadesikan-Kettenring

class robpy.covariance.ogk.OGK(*, store_precision=True, assume_centered=False, location_estimator: ~robpy.univariate.base.LocationOrScaleEstimator = <function median>, scale_estimator: ~robpy.univariate.base.LocationOrScaleEstimator = <function median_abs_deviation>, n_iterations: int = 2, reweighting: bool = False, reweighting_beta: float = 0.9)[source]

Bases: RobustCovariance

Implementation of the Orthogonalized Gnanadesikan-Kettenring estimator for location and dispersion proposed in Maronna, R. A., & Zamar, R. H. (2002).

Parameters:
  • store_precision (boolean, optional) – Whether to store the precision matrix. Defaults to True.

  • assume_centered (boolean, optional) – Whether the data is already centered. Defaults to False.

  • location_estimator (LocationOrScaleEstimator, optional) – Function to estimate the location of the data, should accept an array like input as first value and a named argument axis. Defaults to np.median.

  • scale_estimator (LocationOrScaleEstimator, optional) – Function to estimate the scale of the data, should accept an array like input as first value and a named argument axis. Defaults to median_abs_deviation.

  • n_iterations (int, optional) – Number of iterations for the orthogonalization step. Defaults to 2.

  • reweighting (boolean, optional) – Whether to apply reweighting at the end (i.e. calculating regular location and covariance after filtering outliers based on Mahalanobis distance using OGK estimates). Defaults to False.

  • reweighting_beta (float, optional) – Quantile of chi-squared distribution to use as cutoff for the reweighting. Defaults to 0.9.

References

  • Maronna, R. A., & Zamar, R. H. (2002). Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44(4), 307-317.

calculate_covariance(X) ndarray[source]

Wrapping Covariance

class robpy.covariance.wrapping.WrappingCovariance(b: float = 1.5, c: float = 4.0, q1: float = 1.540793, q2: float = 0.8622731, store_precision: bool = True, assume_centered: bool = False)[source]

Bases: RobustCovariance

Covariance estimator based on the wrapping function proposed in Raymaekers, J., & Rousseeuw, P. J. (2021).

The wrapping transformation is defined as follows:

\[\begin{split}\Psi_{b, c}(z) = \begin{cases} z & \text{if } \ 0 \leq |z| < b, \\ q_1 \tanh\left(q_2 (c - |z|)\right) \mathrm{sign}(z) & \text{if } \ b \leq |z| \leq c,\\ 0 & \text{if } \ c < |z|. \end{cases}\end{split}\]

The data is first scaled using the median and the MAD before applying the transformation.

Next, the robust covariance of X is computed from the classical covariance on the transformed data:

\[Cov(Median(X) + MAD(X) * \Psi_{b, c}(X - Median(X) / MAD(X)))\]
Parameters:
  • b (float, optional) – Lower cutoff. Defaults to 1.5.

  • c (float, optional) – Upper cutoff. Defaults to 4.0.

  • q1 (float, optional) – Transformation parameter (see formula). Defaults to 1.540793.

  • q2 (float, optional) – Transformation parameter (see formula). Defaults to 0.8622731.

  • store_precision (bool, optional) – Whether to store the precision matrix. Defaults to True.

  • assume_centered (bool, optional) – Whether the data is already centered. Defaults to False.

References

  • Raymaekers, J., & Rousseeuw, P. J. (2021). Fast robust correlation for high-dimensional data. Technometrics, 63(2), 184-198.

calculate_covariance(X: ndarray) ndarray[source]

Calculate the covariance matrix of the data X after applying the wrapping transformation.

Parameters:

X (np.ndarray) – Data to calculate the covariance matrix of, must have shape (n, p).

Returns:

Robust covariance matrix of X.

Return type:

np.ndarray