robpy.outliers

Module containing all algorithms related to outlier detection.

Detect Deviating Cells

class robpy.outliers.ddc.DDC(chi2_quantile: float = 0.99, min_correlation: float = 0.5, scale_estimator: ~robpy.univariate.base.RobustScale = <robpy.univariate.onestep_m.CellwiseOneStepM object>)[source]

Bases: OutlierMixin

Implementation of the Detecting Deviating Cells (DDC) algorithm. Based on the R implementation in the package cellWise.

Parameters:

chi2_quantile (float, optional) – Quantile of the chi-squared distribution to use as threshold for univariate outlier detection in step 2. Default is 0.99.
min_correlation (float, optional) – Minimum correlation between variables to consider them. Default is 0.5.
scale_estimator (RobustScale, optional) – robust scale estimator to scale the initial data with. Defaults to CellwiseOneStepM().

References

Rousseeuw, P. J., & Van den Bossche, W. (2018). Detecting deviating data cells. Technometrics, 60(2), 135-145.

cellmap(X: DataFrame, standardized_residuals: ndarray | None = None, annotate: bool = False, fmt: str = '.1f', figsize: tuple[int, int] = (7, 10), row_zoom: tuple[int, int] | Index | None = None, col_zoom: tuple[int, int] | Index | None = None, vmax_clip: float = 3.290526731491895, cmap: str | Colormap = 'custom') → Axes[source]

Visualize the standardized residuals of the DDC model as a heatmap.

Parameters:

X (pd.DataFrame) – The data used to predict the residuals.
standardized_residuals (np.ndarray | None, optional) – if X is not the original data used to fit the model, the standardized residuals of the cells predicted on the new X data should be passed.
annotate (bool, optional) – Whether to annotate the heatmap cells with the original values. Defaults to False.
fmt (str, optional) – Format to use for annotations. Defaults to “.1f”.
figsize (tuple[int, int], optional) – Figure size. Defaults to (7, 10).
row_zoom (tuple[int, int] | pd.Index | None, optional) – If not None, a subset of the rows is selected for visualization. A tuple is interpreted as a slice, a pd.Index as a selection. Defaults to None.
col_zoom (tuple[int, int] | pd.Index | None, optional) – Similar to row_zoom but for columns. Defaults to None.
vmax_clip (float) – standardized absolute residuals larger than vmax will get the darkest color and hence get clipped
cmap (str | matplotlib.colors.Colormap, optional) – matplotlib colormap or string, maps the data to the color space.

Returns:

The matplotlib axes with the heatmap.

Return type:

Axes

fit(X: DataFrame, y=None, verbose: bool = False)[source]

impute(X: DataFrame, impute_outliers: bool = True) → DataFrame[source]

predict(X: DataFrame, rowwise: bool = False) → ndarray[source]

Predict outliers in the data.

Parameters:

X (pd.DataFrame) – New data to predict outliers for.
rowwise (bool, optional) – Whether to predict rowwise instead of cellwise outliers. Defaults to False.

Raises:

ValueError – Model not fitted.
ValueError – Data shape mismatch.

Returns:

If rowwise is True: A 1D array of shape (n_samples,) with rowwise outliers.
If rowwise is False: A matrix of shape (n_samples, n_features) with cellwise outliers and an array containing the standardized residuals of the cells.

Return type:

Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]

robpy.outliers.ddc.get_custom_cmap(vmax_clip: float, neutral_color: str = '#f7f286') → Colormap[source]