robpy.outliers
Module containing all algorithms related to outlier detection.
Detect Deviating Cells
- class robpy.outliers.ddc.DDC(chi2_quantile: float = 0.99, min_correlation: float = 0.5, scale_estimator: ~robpy.univariate.base.RobustScale = <robpy.univariate.onestep_m.CellwiseOneStepM object>)[source]
Bases:
OutlierMixinImplementation of the Detecting Deviating Cells (DDC) algorithm. Based on the R implementation in the package cellWise.
- Parameters:
chi2_quantile (float, optional) – Quantile of the chi-squared distribution to use as threshold for univariate outlier detection in step 2. Default is 0.99.
min_correlation (float, optional) – Minimum correlation between variables to consider them. Default is 0.5.
scale_estimator (RobustScale, optional) – robust scale estimator to scale the initial data with. Defaults to CellwiseOneStepM().
References
Rousseeuw, P. J., & Van den Bossche, W. (2018). Detecting deviating data cells. Technometrics, 60(2), 135-145.
- cellmap(X: DataFrame, standardized_residuals: ndarray | None = None, annotate: bool = False, fmt: str = '.1f', figsize: tuple[int, int] = (7, 10), row_zoom: tuple[int, int] | Index | None = None, col_zoom: tuple[int, int] | Index | None = None, vmax_clip: float = 3.290526731491895, cmap: str | Colormap = 'custom') Axes[source]
Visualize the standardized residuals of the DDC model as a heatmap.
- Parameters:
X (pd.DataFrame) – The data used to predict the residuals.
standardized_residuals (np.ndarray | None, optional) – if X is not the original data used to fit the model, the standardized residuals of the cells predicted on the new X data should be passed.
annotate (bool, optional) – Whether to annotate the heatmap cells with the original values. Defaults to False.
fmt (str, optional) – Format to use for annotations. Defaults to “.1f”.
figsize (tuple[int, int], optional) – Figure size. Defaults to (7, 10).
row_zoom (tuple[int, int] | pd.Index | None, optional) – If not None, a subset of the rows is selected for visualization. A tuple is interpreted as a slice, a pd.Index as a selection. Defaults to None.
col_zoom (tuple[int, int] | pd.Index | None, optional) – Similar to row_zoom but for columns. Defaults to None.
vmax_clip (float) – standardized absolute residuals larger than vmax will get the darkest color and hence get clipped
cmap (str | matplotlib.colors.Colormap, optional) – matplotlib colormap or string, maps the data to the color space.
- Returns:
The matplotlib axes with the heatmap.
- Return type:
Axes
- predict(X: DataFrame, rowwise: bool = False) ndarray[source]
Predict outliers in the data.
- Parameters:
X (pd.DataFrame) – New data to predict outliers for.
rowwise (bool, optional) – Whether to predict rowwise instead of cellwise outliers. Defaults to False.
- Raises:
ValueError – Model not fitted.
ValueError – Data shape mismatch.
- Returns:
If rowwise is True: A 1D array of shape (n_samples,) with rowwise outliers.
If rowwise is False: A matrix of shape (n_samples, n_features) with cellwise outliers and an array containing the standardized residuals of the cells.
- Return type:
Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]