robpy.preprocessing
Data Cleaning
- class robpy.preprocessing.data_cleaner.DataCleaner(max_missing_frac_cols: float = 0.5, max_missing_frac_rows: float = 0.5, min_unique_values: int = 3, min_abs_scale: float = 1e-12, clean_na_first: str = 'automatic', min_n_rows: int = 3)[source]
Bases:
OneToOneFeatureMixin,TransformerMixin,BaseEstimatorCleans a dataset before an analysis.
Typically used before DDC, cellMCD, transfo…
Based on the R function checkDataSet in the package cellWise: [https://rdrr.io/cran/cellWise/man/checkDataSet.html]
- Parameters:
max_missing_frac_cols (float, optional) – Keep only the columns that have a proportion of missing values lower than this threshold. Defaults to 0.5.
max_missing_frac_rows (float, optional) – Keep only the rows that have a proportion of missing values lower than this threshold. Defaults to 0.5.
min_unique_values (int, optional) – Any column with min_unique_values or fewer unique values will be classified as discrete and excluded from the cleaned dataset. Defaults to 3.
min_abs_scale (float, optional) – Only columns whose scale is larger than min_abs_scale will be considered (scale is measure by the mad). Defaults to 1e-12.
clean_na_first (str, optional) – One out of “automatic”, “columns”, “rows”. Decides which are first checked for NAs. If “automatic”, columns are checked first if if p >= 5n, else rows are checked first. Defaults to “automatic”.
min_n_rows (int, optional) – Integer specifying the minimum number of rows/observations wanted for the input data. Defaults to 3.
- property dropped_columns: dict[str, list]
Return the names of the columns that were dropped during the cleaning process.
- Returns:
Mapping from reason for dropping to list of column names.
- Return type:
dict[str, list]
- Raises:
NotFittedError – if the dropped column attributes weren’t set yet.
- property dropped_rows: dict[str, list]
Return the indices of the rows that were dropped during the cleaning process.
- Returns:
mapping from reason for dropping to list of row indices.
- Return type:
dict[str, list]
- Raises:
NotFittedError – if the dropped row attributes weren’t set yet.
Data Scaling
- class robpy.preprocessing.scaling.RobustScaler(scale_estimator: ~robpy.univariate.base.RobustScale = <robpy.univariate.mcd.UnivariateMCD object>, with_centering: bool = True, with_scaling: bool = True)[source]
Bases:
OneToOneFeatureMixin,TransformerMixin,BaseEstimatorScaling features using a univariate RobustScaleEstimator.
- Parameters:
scale_estimator (RobustScale, optional) – Robust scale estimator to scale the data with. Defaults to UnivariateMCD().
with_centering (boolean, optional) – Whether to center the data. Defaults to True.
with_scaling (boolean, optional) – Whether to scale the data. Defaults to True.
Data Transforming
- class robpy.preprocessing.transfo.RobustPowerTransformer(method: Literal['boxcox', 'yeojohnson', 'auto'] = 'yeojohnson', standardize: bool = True, lambda_range: tuple[float, float] = (-4.0, 6.0), quantile: float = 0.99, nsteps: int = 2)[source]
Bases:
OneToOneFeatureMixin,TransformerMixin,BaseEstimatorApply a robust power transformation using reweighted maximum likelihood to transform the features closer to normality. Uses the Box-Cox or the Yeo-Johnson transformation.
- Parameters:
method (Literal str, optional) – Method used for the power transformation. Can be “boxcox” for Box-Cox, “yeojohnson” for Yeo-Johnson, or “auto” for the solution with the lowest objective function. Box-Cox can only be used for strictly positive features. Defaults to “auto”.
standardize (boolean, optional) – Whether to standardize the features before and after the power transformation. Defaults to True.
quantile (float, optional) – Quantile used to calculate the weights. Defaults to 0.99.
nsteps (int, optional) – Number of steps used in the reweighted maximum likelihood. Defaults to 2.
References
Raymaekers, J., & Rousseeuw, P. J. (2024). Transforming variables to central normality. Machine Learning, 113(8), 4953-4975.
- fit(x: ndarray)[source]
Calculates lambda, the transformation parameter depending on the method.
- Parameters:
x (np.ndarray) – The data.
Utils
- robpy.preprocessing.utils.wrapping_transformation(X: ~numpy.ndarray, b: float = 1.5, c: float = 4.0, q1: float = 1.540793, q2: float = 0.8622731, rescale: bool = True, location_estimator: ~typing.Callable[[~numpy.ndarray, int], ~numpy.ndarray] = <function median>, scale_estimator: ~typing.Callable[[~numpy.ndarray, int], ~numpy.ndarray] = <function median_abs_deviation>) ndarray[source]
Implementation of the wrapping transformation using the following function:
\[\begin{split}\Psi_{b, c}(z) = \begin{cases} z & \text{if } \ 0 \leq |z| < b, \\ q_1 \tanh\left(q_2 (c - |z|)\right) \mathrm{sign}(z) & \text{if } \ b \leq |z| \leq c, \\ 0 & \text{if } \ c < |z|. \end{cases}\end{split}\]- Parameters:
X (np.ndarray) – Data to be transformed, must have shape (N, D).
b (float, optional) – Lower cutoff. Defaults to 1.5.
c (float, optional) – Upper cutoff. Defaults to 4.0.
q1 (float, optional) – Transformation parameter (see formula). Defaults to 1.540793.
q2 (float, optional) – Transformation parameter (see formula). Defaults to 0.8622731.
rescale (bool, optional) – Whether to rescale the wrapped data such that the robust location and scale of the transformed data are the same as the original data. Defaults to True.
location_estimator (Callable[[np.ndarray, int], np.ndarray], optional) – Function to estimate the location of the data, should accept an array like input as first value and a named argument axis. Defaults to np.median.
scale_estimator (Callable[[np.ndarray, int], np.ndarray], optional) – Function to estimate the scale of the data, should accept an array like input as first value and a named argument axis. Defaults to median_abs_deviation.
- Returns:
The transformed data.
- Return type:
np.ndarray
References
Raymaekers, J., & Rousseeuw, P. J. (2021). Fast robust correlation for high-dimensional data. Technometrics, 63(2), 184-198.