robpy.outliers

Module containing all algorithms related to outlier detection.

Detect Deviating Cells

class robpy.outliers.ddc.DDC(chi2_quantile: float = 0.99, min_correlation: float = 0.5, scale_estimator: ~robpy.univariate.base.RobustScale = <robpy.univariate.onestep_m.CellwiseOneStepM object>)[source]

Bases: OutlierMixin

Implementation of the Detecting Deviating Cells (DDC) algorithm. Based on the R implementation in the package cellWise.

Parameters:
  • chi2_quantile (float, optional) – Quantile of the chi-squared distribution to use as threshold for univariate outlier detection in step 2. Default is 0.99.

  • min_correlation (float, optional) – Minimum correlation between variables to consider them. Default is 0.5.

  • scale_estimator (RobustScale, optional) – robust scale estimator to scale the initial data with. Defaults to CellwiseOneStepM().

References

  • Rousseeuw, P. J., & Van den Bossche, W. (2018). Detecting deviating data cells. Technometrics, 60(2), 135-145.

cellmap(X: DataFrame, standardized_residuals: ndarray | None = None, annotate: bool = False, fmt: str = '.1f', figsize: tuple[int, int] = (7, 10), row_zoom: tuple[int, int] | Index | None = None, col_zoom: tuple[int, int] | Index | None = None, vmax_clip: float = 3.290526731491895, cmap: str | Colormap = 'custom') Axes[source]

Visualize the standardized residuals of the DDC model as a heatmap.

Parameters:
  • X (pd.DataFrame) – The data used to predict the residuals.

  • standardized_residuals (np.ndarray | None, optional) – if X is not the original data used to fit the model, the standardized residuals of the cells predicted on the new X data should be passed.

  • annotate (bool, optional) – Whether to annotate the heatmap cells with the original values. Defaults to False.

  • fmt (str, optional) – Format to use for annotations. Defaults to “.1f”.

  • figsize (tuple[int, int], optional) – Figure size. Defaults to (7, 10).

  • row_zoom (tuple[int, int] | pd.Index | None, optional) – If not None, a subset of the rows is selected for visualization. A tuple is interpreted as a slice, a pd.Index as a selection. Defaults to None.

  • col_zoom (tuple[int, int] | pd.Index | None, optional) – Similar to row_zoom but for columns. Defaults to None.

  • vmax_clip (float) – standardized absolute residuals larger than vmax will get the darkest color and hence get clipped

  • cmap (str | matplotlib.colors.Colormap, optional) – matplotlib colormap or string, maps the data to the color space.

Returns:

The matplotlib axes with the heatmap.

Return type:

Axes

fit(X: DataFrame, y=None, verbose: bool = False)[source]
impute(X: DataFrame, impute_outliers: bool = True) DataFrame[source]
predict(X: DataFrame, rowwise: bool = False) ndarray[source]

Predict outliers in the data.

Parameters:
  • X (pd.DataFrame) – New data to predict outliers for.

  • rowwise (bool, optional) – Whether to predict rowwise instead of cellwise outliers. Defaults to False.

Raises:
  • ValueError – Model not fitted.

  • ValueError – Data shape mismatch.

Returns:

  • If rowwise is True: A 1D array of shape (n_samples,) with rowwise outliers.

  • If rowwise is False: A matrix of shape (n_samples, n_features) with cellwise outliers and an array containing the standardized residuals of the cells.

Return type:

Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]

robpy.outliers.ddc.get_custom_cmap(vmax_clip: float, neutral_color: str = '#f7f286') Colormap[source]