robpy.covariance

Base

class robpy.covariance.base.RobustCovariance(*, store_precision=True, assume_centered=False, nans_allowed=False)[source]

Bases: EmpiricalCovariance

calculate_covariance(X) → ndarray[source]

property correlation: ndarray

property covariance: ndarray

distance_distance_plot(chi2_percentile: float = 0.975, figsize: tuple[int, int] = (4, 4))[source]

fit(X: ndarray | DataFrame) → RobustCovariance[source]: Fit the covariance estimator.

Cell MCD

class robpy.covariance.cellmcd.CellMCD(*, alpha: float = 0.75, quantile: float = 0.99, crit: float = 0.0001, max_c_steps: int = 100, min_eigenvalue: float = 0.0001, verbosity: int = 30)[source]

Bases: RobustCovariance

Cell MCD estimator based on the algorithm proposed in Raymaekers and Rousseeuw (2023).

Parameters:

alpha (float, optional) – Percentage indicating how much cells must remain unflagged in each column. Defaults to 0.75.
quantile (float, optional) – Cutoff value to flag cells. Defaults to 0.99.
crit (float, optional) – Stop iterating when successive covariance matrices of the standardized data differ by less than crit. Defaults to 1e-4
max_c_steps (int, optional) – Maximum number of C-steps used in the algorithm. Defaults to 100.
min_eigenvalue – (float, optional): Lower bound on the minimum eigenvalue of the covariance estimator on the standardized data. Should be at least 1e-6. Defaults to 1e-4.

References

Raymaekers and Rousseeuw, The Cellwise Minimum Covariance Determinant Estimator, 2023, Journal of the American Statistical Association.

calculate_covariance(X: ndarray) → ndarray[source]

cell_MCD_plot(variable: int, variable_name: str = 'variable', row_names: list | None = None, second_variable: int | None = None, second_variable_name: str = 'second variable', plottype: Literal['indexplot', 'residuals_vs_variable', 'residuals_vs_predictions', 'variable_vs_predictions', 'bivariate'] = 'indexplot', figsize: tuple[int, int] = (8, 8), annotation_quantile: float | None = None)[source]

Function to plot the results of a cellMCD analysis: 5 types of diagnostic plots.

Parameters:

plottype (Literal string, optional) –
“indexplot”: plots the residuals of a variable, “residuals_vs_variable”: plots a variable versus its residuals, “residuals_vs_predictions”: plots the predictions of a variable versus its residuals, “variable_vs_predictions”: plots a variable against its predictions, “bivariate”: plots two variables against each other,

Defaults to “indexplot”.
variable (int) – Index of the variable under consideration.
variable_name (str, optional) – Name of the variable of interest for the axis label. Defaults to “variable”.
second_variable (int) – Index of the second variable under consideration, only needed for plottype “bivariate”.
second_variable_name (str, optional) – Name of the second variable for the axis label, only relevant for plottype “bivariate”. Defaults to “second variable”.
row_names (list of strings, optional) – Row_names of the observations if you want the outliers annoted with their name.
figsize (tuple[int,int], optional) – Size of the figure. Defaults to (8,8).
annotation_quantile (float, optional) – the quantile used to draw an imaginary threhsold around the data. Only points outside these thresholds will be annotated. If None, use self.quantile

class robpy.covariance.initial_ddcw.InitialDDCW(*, alpha: float = 0.75, min_eigenvalue: float = 0.0001)[source]

Bases: RobustCovariance

Calculates the initial robust scatter and location estimates for the CellMCD. Described in the Supplementary Material to Raymaekers and Rousseeuw 2023.

code based on cellWise:::DDCWcov in R

Parameters:

alpha (float, optional) – Percentage indicating how much cells must remain unflagged in each column. Defaults to 0.75.
min_eigenvalue (float, optional) – Lower bound on the minimum eigenvalue of the covariance estimator on the standardized data. Should be at least 1e-6. Defaults to 1e-4.

References

Raymaekers and Rousseeuw, The Cellwise Minimum Covariance Determinant Estimator, 2023,

Journal of the American Statistical Association.

calculate_covariance(X: ndarray)[source]

Calculates the initial cellwise robust estimates of location and scatter using an adaptation of DDC.

Parameters:: X (np.ndarray) – scaled data set

[based on cellWise:::DDCWcov]

Kendall’s Tau

class robpy.covariance.kendall.KendallTau(*, store_precision=True, assume_centered=False, nans_allowed=False)[source]

Bases: RobustCovariance

Estimate the covariance matrix using Kendall’s tau correlation.

calculate_covariance(X) → ndarray[source]

Minumum Covariance Determinant

class robpy.covariance.mcd.DetMCD(*, alpha: float | int | None = None, tolerance: float = 1e-08, correct_covariance: bool = True, reweighting: bool = True, verbosity: int = 30)[source]

Bases: RobustCovariance

Deterministic MCD estimator (DetMCD) based on the algorithm proposed in Hubert, Rousseeuw and Verdonck (2012)

Parameters:

alpha (float | int | None, optional) – size of the h subset. If an integer between n/2 and n is passed, it is interpreted as an absolute value. If a float between 0.5 and 1 is passed, it is interpreted as a proportation of n (the training set size). If None, it is set to (n+p+1) / 2. Defaults to None.
tolerance (float, optional) – Minimum difference in determinant between two iterations to stop the C-step
correct_covariance (bool, optional) – Whether to apply a consistency correction to the raw covariance estimate
reweighting (bool, optional) – Whether to apply reweighting to the raw covariance estimate

References

Hubert, Rousseeuw and Verdonck, A deterministic algorithm for robust location and scatter, 2012, Journal of Computational and Graphical Statistics

calculate_covariance(X: ndarray) → ndarray[source]

class robpy.covariance.mcd.FastMCD(*, alpha: float | int | None = None, n_initial_subsets: int = 500, n_initial_c_steps: int = 2, n_best_subsets: int = 10, n_partitions: int | None = None, tolerance: float = 1e-08, correct_covariance: bool = True, reweighting: bool = True, verbosity: int = 30, store_precision=True, assume_centered=False)[source]

Bases: RobustCovariance

Fast MCD estimator based on the algorithm proposed in Rousseeuw and Van Driessen (1999)

Parameters:

alpha (float | int | None, optional) – size of the h subset. If an integer between n/2 and n is passed, it is interpreted as an absolute value. If a float between 0.5 and 1 is passed, it is interpreted as a proportation of n (the training set size). If None, it is set to (n+p+1) / 2. Defaults to None.
n_initial_subsets (int, optional) – number of initial random subsets of size p+1
n_initial_c_steps (int, optional) – number of initial c steps to perform on all initial subsets
n_best_subsets (int, optional) – number of best subsets to keep and perform c steps on until convergence
n_partitions (int, optional) – Number of partitions to split the data into. This can speed up the algorithm for large datasets (n > 600 suggested in paper) If None, 5 partitions are used if n > 600, otherwise 1 partition is used.
tolerance (float, optional) – Minimum difference in determinant between two iterations to stop the C-step
correct_covariance (bool, optional) – Whether to apply a consistency correction to the raw covariance estimate
reweighting (bool, optional) – Whether to apply reweighting to the raw covariance estimate

References

Rousseeuw and Van Driessen, A Fast Algorithm for the Minimum Covariance Determinant Estimator, 1999, American Statistical Association and the American Society for Quality, TECHNOMETRICS

calculate_covariance(X) → ndarray[source]

class robpy.covariance.mcd.HSubset(indices: numpy.ndarray, location: numpy.ndarray, scale: numpy.ndarray, determinant: float, n_c_steps: int = 0)[source]

Bases: object

determinant: float

indices: ndarray

location: ndarray

n_c_steps: int = 0

scale: ndarray

Orthogonalized Gnanadesikan-Kettenring

class robpy.covariance.ogk.OGK(*, store_precision=True, assume_centered=False, location_estimator: ~robpy.univariate.base.LocationOrScaleEstimator = <function median>, scale_estimator: ~robpy.univariate.base.LocationOrScaleEstimator = <function median_abs_deviation>, n_iterations: int = 2, reweighting: bool = False, reweighting_beta: float = 0.9)[source]

Bases: RobustCovariance

Implementation of the Orthogonalized Gnanadesikan-Kettenring estimator for location dispersion proposed in Maronna, R. A., & Zamar, R. H. (2002)

Parameters:

store_precision (boolean, optional) – whether to store the precision matrix
assume_centered (boolean, optional) – whether the data is already centered
location_estimator (LocationOrScaleEstimator, optional) – function to estimate the location of the data, should accept an array like input as first value and a named argument axis
scale_estimator (LocationOrScaleEstimator, optional) – function to estimate the scale of the data, should accept an array like input as first value and a named argument axis
n_iterations (int, optional) – number of iteration for orthogonalization step
reweighting (boolean, optional) – whether to apply reweighting at the end (i.e. calculating regular location and covariance after filtering outliers based on Mahalanobis distance using OGK estimates)
reweighting_beta (float, optional) – quantile of chi2 distribution to use as cutoff for reweighting

References

Maronna, R. A., & Zamar, R. H. (2002). Robust Estimates of Location and Dispersion for High-Dimensional Datasets. Technometrics, 44(4), 307–317. http://www.jstor.org/stable/1271538

calculate_covariance(X) → ndarray[source]: Calculate location and covariance with the algorithm described in Maronna & Zamar (2002). Covariance is returned, location is overwritten.

Wrapping Covariance

class robpy.covariance.wrapping.WrappingCovariance(b: float = 1.5, c: float = 4.0, q1: float = 1.540793, q2: float = 0.8622731, rescale: bool = True, store_precision: bool = True, assume_centered: bool = False)[source]

Bases: RobustCovariance

Covariance estimator based on the wrapping function proposed in Jakob Raymaekers & Peter J. Rousseeuw (2021)

The wrapping transformation is defined as follows:

\[\begin{split}\Psi_{b, c}(z) = \begin{cases} z & if \ 0 \leq |z| < b \\ q_1 \tanh\left(q_2 (c - |z|)\right) \mathrm{sign}(z) & if \ b \leq |z| \leq c \\ 0 & if \ c < |z| \end{cases}\end{split}\]

Data is first scaled using median and MAD before applying the transformation.

The (standard) covariance is subsequently estimated on the rescaled data Cov(X) = Cov(Median(X) + MAD(X) * phi(X - Median(X) / MAD(X)))

References

Jakob Raymaekers & Peter J. Rousseeuw (2021) Fast Robust Correlation for High-Dimensional Data, Technometrics, 63:2, 184-198, DOI: 10.1080/00401706.2019.1677270

Parameters:

X – data to be transformed, must have shape (N, D)
b – lower cutoff
c – upper cutoff
q1 – transformation parameters (see formula)
q2 – transformation parameters (see formula)
rescale – whether to rescale the wrapped data so the robust location and scale of the transformed data are the same as the original data

calculate_covariance(X: ndarray) → ndarray[source]

Calculate the covariance matrix of the data X after applying the wrapping transformation

Parameters:: X – data to calculate the covariance matrix of, must have shape (N, D)
Returns:: robust covariance matrix of X