robpy.regression

Base

class robpy.regression.base.RobustRegression[source]

Bases: RegressorMixin, BaseEstimator

Base class for robust regression estimators.

fit(X, y) → RobustRegression[source]

outlier_map(X, y, robust_scaling: bool = True, robust_distance: bool = True, vertical_outlier_threshold: float = 2.5, leverage_threshold_percentile: float = 0.975, figsize: tuple[int, int] = (4, 4), return_data: bool = False) → None | tuple[ndarray, ndarray, ndarray, float, float][source]

Creates a diagnostic plot where the robust residuals of the target are plotted against the robust Mahalanobis distances of the features.

Parameters:

X (array like of shape (n_samples, n_features)) – Training features.
y (array like of shape (n_samples, )) – Training targets.
robust_scaling (bool, optional) – Whether to scale the residuals using MAD instead of std. Defaults to True.
robust_distance (bool, optional) – Whether to use the MCD as loc/scale estimator instead of mean/cov for calculating the Mahalanobis distances. Defaults to True.
vertical_outlier_threshold (float, optional) – Where to draw the upper (and lower) limit for the standardized residuals to indicate outliers. Defaults to 2.5.
leverage_threshold_percentile (float, optional) – Which percentile from the chi-squared distribution to use to set as threshold for leverage points. Defaults to 0.975.
figsize (tuple[int, int], optional) – Size of the plot. Defaults to (4, 4).
return_data (bool, optional) – Whether to return the residuals, the standardized residuals and the distances. Defaults to False.

References

Rousseeuw P.J., Hubert M. (2018). Anomaly detection by robust statistics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2), 1–14.
Rousseeuw P.J., van Zomeren B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85(411), 633–651.

predict(X)[source]

property scale: float

Least Trimmed Squares

class robpy.regression.lts.FastLTSRegression(alpha: float = 0.5, n_initial_subsets: int = 500, n_initial_c_steps: int = 2, n_best_models: int = 10, reweighting: bool = True, tolerance: float = 1e-15, random_state: int = 42)[source]

Bases: RobustRegression

Implementation of the FAST-LTS model based on the R implementation of the ltsReg method in the robustbase R package (cfr. https://www.rdocumentation.org/packages/robustbase/versions/0.93-8/topics/ltsReg) and the python implementation Reweighted-FastLTS (cfr. https://github.com/GiuseppeCannata/Reweighted-FastLTS/blob/master/Reweighted_FastLTS.py).

Parameters:

alpha (float, optional) – Percentage of the data to consider as subset for calculating the trimmed squared error. Must be between 0.5 and 1, with 1 corresponding to the classic LS regression. Defaults to 0.5.
n_initial_subset (int, optional) – Number of initial subsets to apply C-steps to (cfr m in original R implementation). Defaults to 500.
n_initial_c_steps (int, optional) – Number of c-steps to apply to n_initial_subsets before final c-steps until convergence. Defaults to 2.
n_best_models (int, optional) – Number of best models after initial c-steps to consider until convergence. Defaults to 10.
reweighting (bool, optional) – Whether to apply reweighting to the raw estimates. Defaults to True.
tolerance (float, optional) – Acceptable delta in loss value between C-steps. If current loss - previous loss <= tolerance, model is converged. Defaults to 1e-15.

fit(X: ndarray | DataFrame, y: ndarray | Series, initial_weights: ndarray | None = None, verbosity: int = 20) → FastLTSRegression[source]

Fit the model to the data.

Parameters:

X (np.ndarray | pd.DataFrame) – Training features.
y (np.ndarray | pd.Series) – Training labels.
initial_weights (np.ndarray | None, optional) – Optionally pass fixed initial weights, in case of n_initial_subsets > 1, this means all models start from the same initial weights. There is therefore no benefit from setting n_initial_subsets > 1. Defaults to None.
verbosity (int, optional) – description. Defaults to logging.INFO.

Returns:

The fitted FastLTS object.

Reference:

Rousseeuw P.J. (1984). Least Median of Squares Regression. Journal of the American Statistical Association, 79(388), 871–880.

predict(X: ndarray | DataFrame) → ndarray[source]

robpy.regression.lts.get_correction_factor(p: int, n: int, alpha: float) → float[source]

Calculate the small sample correction factor for the scale resulting from LTS regression when there is no reweighting.

References

Pison, G., Van Aelst, S., & Willems, G. (2002). Small sample corrections for LTS and MCD. Metrika, 55(1), 111-123.
https://github.com/cran/robustbase/blob/c4b9d21cfc4beb64653bb2ffba9e549e2dbb98ed/R/ltsReg.R

robpy.regression.lts.get_correction_factor_reweighting(p: int, n: int, alpha: float) → float[source]

Calculate the small sample correction factor for the scale resulting from LTS regression when there is reweighting.

References

Pison, G., Van Aelst, S., & Willems, G. (2002). Small sample corrections for LTS and MCD. Metrika, 55(1), 111-123.
https://github.com/cran/robustbase/blob/c4b9d21cfc4beb64653bb2ffba9e549e2dbb98ed/R/ltsReg.R

S Regression

class robpy.regression.s.SRegression(rho: ~robpy.utils.rho.BaseRho = <robpy.utils.rho.TukeyBisquare object>, n_initial_subsets: int = 500, n_initial_i_steps: int = 2, n_best_subsets: int = 5, max_scale_iterations: int = 2, b: float = 0.5, fit_intercept: bool = True, relative_tolerance: float = 1e-07, scale_tolerance: float = 1e-10, random_state: int = 101)[source]

Bases: RobustRegression

S-regression, proposed by Rousseeuw, P. J., & Yohai, V. J. (1984). This code is an implementation of the Fast S algorithm described in Salibian-Barrera, M., & Yohai, V. J. (2006).

Parameters:

rho (BaseRho, optional) – Score function to use on the residuals. Defaults to TukeyBisquare(c=1.547).
n_initial_subsets (int, optional) – Number of initial subsets to sample (N in the original paper). Defaults to 500.
n_initial_i_steps (int, optional) – Number of i-steps to take on the initial subsets (k in the original paper). Defaults to 2.
n_best_subsets (int, optional) – Number of subsets with the best M-scales (residuals transformered by score function) (t in the original paper). Defaults to 5.
max_scale_iterations (int, optional) – Number of iterative steps to derive M-scale estimates (r in the original paper). Defaults to 2.
b (float, optional) – constant on the RHS of the M scale equation. Defaults to 0.5.
fit_intercept (bool, optional) – Whether an intercept should be included in the linear regression. Defaults to True.
relative_tolerance (float, optional) – Determines the stopping criterium for the i-steps until convergence (difference in beta norm should be higher than relative_tolerance * max(relative_tolerance, beta_norm)). Defaults to 1e-7.
scale_tolerance (float, optional) – If the difference between 2 subsequent scale estimates is below this threshold, the iterations are stopped and it is assumed the scale estimate converged. Defaults to 1e-10.
random_state (int, optional) – Can be used to provide a random state. Defaults to 101.

References

Rousseeuw, P. J., & Yohai, V. J. (1984). Robust Regression by Means of S-Estimators. In: Franke, J., Härdle, W., Martin, D. (eds) Robust and Nonlinear Time Series Analysis. Lecture Notes in Statistics, vol 26. Springer, New York, NY.
Salibian-Barrera, M., & Yohai, V. J. (2006). A Fast Algorithm for S-Regression Estimates. Journal of Computational and Graphical Statistics, 15(2), 414–427.

fit(X, y, verbosity=30)[source]

predict(X) → ndarray[source]

MM Regression

class robpy.regression.mm.MMRegression(initial_estimator: ~robpy.regression.base.RobustRegression = SRegression(), rho: ~robpy.utils.rho.BaseRho = <robpy.utils.rho.TukeyBisquare object>, max_iterations: int = 500, epsilon: float = 1e-07)[source]

Bases: RobustRegression

Implementation of MM-regression estimator of Yohai, V. J. (1987).

Parameters:

initial_estimator (RobustRegression, optional) – Initial regression estimator. Defaults to SRegression.
rho (BaseRho, optional) – The rho-function used for the MM-estimate. Defaults to TukeyBisquare(c=3.44).
max_iterations (int, optional) – Maximum number of iterations. Defaults to 500.
epsilon (float, optional) – If the absolute difference between all the new and old residuals in an iteration is below epsilon, we stop the computation. Defautls to 1e-7.

References

Yohai, V. J. (1987). High breakdown-point and high efficiency robust estimates for regression. The Annals of statistics, 15(2), 642-656.

fit(X: ndarray | DataFrame, y: ndarray | Series) → MMRegression[source]

predict(X: ndarray | DataFrame) → ndarray[source]