robpy.preprocessing

Data Cleaning

class robpy.preprocessing.data_cleaner.DataCleaner(max_missing_frac_cols: float = 0.5, max_missing_frac_rows: float = 0.5, min_unique_values: int = 3, min_abs_scale: float = 1e-12, clean_na_first: str = 'automatic', min_n_rows: int = 3)[source]

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Cleans a dataset before an analysis.

Typically used before DDC, cellMCD, transfo…

based on: [https://rdrr.io/cran/cellWise/man/checkDataSet.html]

Initialize DataCleaner

Parameters:

max_missing_frac_cols (float, optional) – Keep only the columns that have a proportion of missing values lower than this threshold. Defaults to 0.5.
max_missing_frac_rows (float, optional) – Keep only the rows that have a proportion of missing values lower than this threshold. Defaults to 0.5.
min_unique_values (int, optional) – Any column with min_unique_values or fewer unique values will be classified as discrete and excluded from the cleaned dataset. Defaults to 3.
min_abs_scale (float, optional) – Only columns whose scale is larger than min_abs_scale will be considered (scale is measure by the mad). Defaults to 1e-12.
clean_na_first (str, optional) – One out of “automatic”, “columns”, “rows”. Decides which are first checked for NAs. If “automatic”, columns are checked first if if p >= 5n, else rows are checked first. Defaults to “automatic”.
min_n_rows (int, optional) – Integer specifying the minimum number of rows/observations wanted for the input data. Defaults to 3.

property dropped_columns: dict[str, list]

Return the columns names that were dropped during the cleaning process.

Returns:: Mapping from reason for dropping to list of column names.
Return type:: dict[str, list]
Raises:: NotFittedError – if the dropped column attributes weren’t set yet.

property dropped_rows: dict[str, list]

Return the rows indices that were dropped during the cleaning process.

Returns:: mapping from reason for dropping to list of row indices.
Return type:: dict[str, list]
Raises:: NotFittedError – if the dropped row attributes weren’t set yet.

fit(X: DataFrame)[source]: X (pd.DataFrame): input dataset.

transform(X: DataFrame)[source]: X (pd.DataFrame): input dataset.

Data Scaling

class robpy.preprocessing.scaling.RobustScaler(scale_estimator: ~robpy.univariate.base.RobustScaleEstimator = <robpy.univariate.mcd.UnivariateMCDEstimator object>, with_centering: bool = True, with_scaling: bool = True)[source]

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Scaling features using a RobustScaleEstimator

fit(X: ndarray | DataFrame, ignore_nan: bool = False)[source]

inverse_transform(X: ndarray | DataFrame)[source]

transform(X: ndarray | DataFrame)[source]

Data Transforming

class robpy.preprocessing.transfo.RobustPowerTransformer(method: Literal['boxcox', 'yeojohnson', 'auto'] = 'yeojohnson', standardize: bool = True, lambda_range: tuple[float, float] = (-4.0, 6.0), quantile: float = 0.99, nsteps: int = 2)[source]

Bases: OneToOneFeatureMixin, TransformerMixin, BaseEstimator

Apply a robust power transformation using reweighted maximum likelihood to transform the features closer to normality. Uses the Box-Cox or the Yeo-Johnson transformation.

Parameters:

method (Literal str, optional) – method used for the power transformation. Can be “boxcox” for Box-Cox, “yeojohnson” for Yeo-Johnson, or “auto” for best objective solution. Box-Cox can only be used for strictly positive features. Defaults to “auto”.
standardize (boolean, optional) – whether to standardize the features before and after the power transformation. Defaults to True.
quantile (float, optional) – quantile used to calculate the weights. Defaults to 0.99.
nsteps (int, optional) – number of steps used in the reweighted maximum likelihood. Defaults to 2.

fit(x: ndarray)[source]

Calculates lambda, the transformation parameter depending on the method.

Parameters:: x (np.array) – data.

inverse_transform(x: ndarray) → ndarray[source]

Transforms the data back using inverse Yeo-Johnson/Box-cox, the previously fitted lambda estimate and the corresponding method are used.

Parameters:: x (np.array) – data.

transform(x: ndarray) → ndarray[source]

Transforms the data using the calculated lambda estimate and the corresponding method.

Parameters:: x (np.array) – data.

Utils

robpy.preprocessing.utils.wrapping_transformation(X: ~numpy.ndarray, b: float = 1.5, c: float = 4.0, q1: float = 1.540793, q2: float = 0.8622731, rescale: bool = False, location_estimator: ~typing.Callable[[~numpy.ndarray, int], ~numpy.ndarray] = <function median>, scale_estimator: ~typing.Callable[[~numpy.ndarray, int], ~numpy.ndarray] = <function median_abs_deviation>) → ndarray[source]

Implementation of wrapping using this transformation function:

\[\begin{split}\Psi_{b, c}(z) = \begin{cases} z & if \ 0 \leq |z| < b \\ q_1 \tanh\left(q_2 (c - |z|)\right) \mathrm{sign}(z) & if \ b \leq |z| \leq c \\ 0 & if \ c < |z| \end{cases}\end{split}\]

Parameters:

X – data to be transformed, must have shape (N, D)
b – lower cutoff
c – upper cutoff
q1 – transformation parameters (see formula)
q2 – transformation parameters (see formula)
rescale – whether to rescale the wrapped data so the robust location and scale of the transformed data are the same as the original data
locations – location estimates of the columns of X (optional)
scales – scale estimates of the columns of X (optional)

Returns:

transformed data