robpy.preprocessing

Data Cleaning

class robpy.preprocessing.data_cleaner.DataCleaner(max_missing_frac_cols: float = 0.5, max_missing_frac_rows: float = 0.5, min_unique_values: int = 3, min_abs_scale: float = 1e-12, clean_na_first: str = 'automatic', min_n_rows: int = 3)[source]

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Cleans a dataset before an analysis.

Typically used before DDC, cellMCD, transfo…

Based on the R function checkDataSet in the package cellWise: [https://rdrr.io/cran/cellWise/man/checkDataSet.html]

Parameters:
  • max_missing_frac_cols (float, optional) – Keep only the columns that have a proportion of missing values lower than this threshold. Defaults to 0.5.

  • max_missing_frac_rows (float, optional) – Keep only the rows that have a proportion of missing values lower than this threshold. Defaults to 0.5.

  • min_unique_values (int, optional) – Any column with min_unique_values or fewer unique values will be classified as discrete and excluded from the cleaned dataset. Defaults to 3.

  • min_abs_scale (float, optional) – Only columns whose scale is larger than min_abs_scale will be considered (scale is measure by the mad). Defaults to 1e-12.

  • clean_na_first (str, optional) – One out of “automatic”, “columns”, “rows”. Decides which are first checked for NAs. If “automatic”, columns are checked first if if p >= 5n, else rows are checked first. Defaults to “automatic”.

  • min_n_rows (int, optional) – Integer specifying the minimum number of rows/observations wanted for the input data. Defaults to 3.

property dropped_columns: dict[str, list]

Return the names of the columns that were dropped during the cleaning process.

Returns:

Mapping from reason for dropping to list of column names.

Return type:

dict[str, list]

Raises:

NotFittedError – if the dropped column attributes weren’t set yet.

property dropped_rows: dict[str, list]

Return the indices of the rows that were dropped during the cleaning process.

Returns:

mapping from reason for dropping to list of row indices.

Return type:

dict[str, list]

Raises:

NotFittedError – if the dropped row attributes weren’t set yet.

fit(X: DataFrame)[source]
Parameters:

X (pd.DataFrame) – The input dataset.

transform(X: DataFrame)[source]
Parameters:

X (pd.DataFrame) – The input dataset.

Data Scaling

class robpy.preprocessing.scaling.RobustScaler(scale_estimator: ~robpy.univariate.base.RobustScale = <robpy.univariate.mcd.UnivariateMCD object>, with_centering: bool = True, with_scaling: bool = True)[source]

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Scaling features using a univariate RobustScaleEstimator.

Parameters:
  • scale_estimator (RobustScale, optional) – Robust scale estimator to scale the data with. Defaults to UnivariateMCD().

  • with_centering (boolean, optional) – Whether to center the data. Defaults to True.

  • with_scaling (boolean, optional) – Whether to scale the data. Defaults to True.

fit(X: ndarray | DataFrame, ignore_nan: bool = False)[source]
inverse_transform(X: ndarray | DataFrame)[source]
transform(X: ndarray | DataFrame)[source]

Data Transforming

class robpy.preprocessing.transfo.RobustPowerTransformer(method: Literal['boxcox', 'yeojohnson', 'auto'] = 'yeojohnson', standardize: bool = True, lambda_range: tuple[float, float] = (-4.0, 6.0), quantile: float = 0.99, nsteps: int = 2)[source]

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Apply a robust power transformation using reweighted maximum likelihood to transform the features closer to normality. Uses the Box-Cox or the Yeo-Johnson transformation.

Parameters:
  • method (Literal str, optional) – Method used for the power transformation. Can be “boxcox” for Box-Cox, “yeojohnson” for Yeo-Johnson, or “auto” for the solution with the lowest objective function. Box-Cox can only be used for strictly positive features. Defaults to “auto”.

  • standardize (boolean, optional) – Whether to standardize the features before and after the power transformation. Defaults to True.

  • quantile (float, optional) – Quantile used to calculate the weights. Defaults to 0.99.

  • nsteps (int, optional) – Number of steps used in the reweighted maximum likelihood. Defaults to 2.

References

  • Raymaekers, J., & Rousseeuw, P. J. (2024). Transforming variables to central normality. Machine Learning, 113(8), 4953-4975.

fit(x: ndarray)[source]

Calculates lambda, the transformation parameter depending on the method.

Parameters:

x (np.ndarray) – The data.

inverse_transform(x: ndarray) ndarray[source]

Transforms the data back using inverse Yeo-Johnson/Box-Cox. The previously fitted lambda estimate and the corresponding method are used.

Parameters:

x (np.ndarray) – The data.

transform(x: ndarray) ndarray[source]

Transforms the data using the calculated lambda estimate and the corresponding method.

Parameters:

x (np.ndarray) – The data.

Utils

robpy.preprocessing.utils.wrapping_transformation(X: ~numpy.ndarray, b: float = 1.5, c: float = 4.0, q1: float = 1.540793, q2: float = 0.8622731, rescale: bool = True, location_estimator: ~typing.Callable[[~numpy.ndarray, int], ~numpy.ndarray] = <function median>, scale_estimator: ~typing.Callable[[~numpy.ndarray, int], ~numpy.ndarray] = <function median_abs_deviation>) ndarray[source]

Implementation of the wrapping transformation using the following function:

\[\begin{split}\Psi_{b, c}(z) = \begin{cases} z & \text{if } \ 0 \leq |z| < b, \\ q_1 \tanh\left(q_2 (c - |z|)\right) \mathrm{sign}(z) & \text{if } \ b \leq |z| \leq c, \\ 0 & \text{if } \ c < |z|. \end{cases}\end{split}\]
Parameters:
  • X (np.ndarray) – Data to be transformed, must have shape (N, D).

  • b (float, optional) – Lower cutoff. Defaults to 1.5.

  • c (float, optional) – Upper cutoff. Defaults to 4.0.

  • q1 (float, optional) – Transformation parameter (see formula). Defaults to 1.540793.

  • q2 (float, optional) – Transformation parameter (see formula). Defaults to 0.8622731.

  • rescale (bool, optional) – Whether to rescale the wrapped data such that the robust location and scale of the transformed data are the same as the original data. Defaults to True.

  • location_estimator (Callable[[np.ndarray, int], np.ndarray], optional) – Function to estimate the location of the data, should accept an array like input as first value and a named argument axis. Defaults to np.median.

  • scale_estimator (Callable[[np.ndarray, int], np.ndarray], optional) – Function to estimate the scale of the data, should accept an array like input as first value and a named argument axis. Defaults to median_abs_deviation.

Returns:

The transformed data.

Return type:

np.ndarray

References

  • Raymaekers, J., & Rousseeuw, P. J. (2021). Fast robust correlation for high-dimensional data. Technometrics, 63(2), 184-198.