robpy.preprocessing
Data Cleaning
- class robpy.preprocessing.data_cleaner.DataCleaner(max_missing_frac_cols: float = 0.5, max_missing_frac_rows: float = 0.5, min_unique_values: int = 3, min_abs_scale: float = 1e-12, clean_na_first: str = 'automatic', min_n_rows: int = 3)[source]
Bases:
OneToOneFeatureMixin,TransformerMixin,BaseEstimatorCleans a dataset before an analysis.
Typically used before DDC, cellMCD, transfo…
based on: [https://rdrr.io/cran/cellWise/man/checkDataSet.html]
Initialize DataCleaner
- Parameters:
max_missing_frac_cols (float, optional) – Keep only the columns that have a proportion of missing values lower than this threshold. Defaults to 0.5.
max_missing_frac_rows (float, optional) – Keep only the rows that have a proportion of missing values lower than this threshold. Defaults to 0.5.
min_unique_values (int, optional) – Any column with min_unique_values or fewer unique values will be classified as discrete and excluded from the cleaned dataset. Defaults to 3.
min_abs_scale (float, optional) – Only columns whose scale is larger than min_abs_scale will be considered (scale is measure by the mad). Defaults to 1e-12.
clean_na_first (str, optional) – One out of “automatic”, “columns”, “rows”. Decides which are first checked for NAs. If “automatic”, columns are checked first if if p >= 5n, else rows are checked first. Defaults to “automatic”.
min_n_rows (int, optional) – Integer specifying the minimum number of rows/observations wanted for the input data. Defaults to 3.
- property dropped_columns: dict[str, list]
Return the columns names that were dropped during the cleaning process.
- Returns:
Mapping from reason for dropping to list of column names.
- Return type:
dict[str, list]
- Raises:
NotFittedError – if the dropped column attributes weren’t set yet.
- property dropped_rows: dict[str, list]
Return the rows indices that were dropped during the cleaning process.
- Returns:
mapping from reason for dropping to list of row indices.
- Return type:
dict[str, list]
- Raises:
NotFittedError – if the dropped row attributes weren’t set yet.
Data Scaling
- class robpy.preprocessing.scaling.RobustScaler(scale_estimator: ~robpy.univariate.base.RobustScaleEstimator = <robpy.univariate.mcd.UnivariateMCDEstimator object>, with_centering: bool = True, with_scaling: bool = True)[source]
Bases:
OneToOneFeatureMixin,TransformerMixin,BaseEstimatorScaling features using a RobustScaleEstimator
Data Transforming
- class robpy.preprocessing.transfo.RobustPowerTransformer(method: Literal['boxcox', 'yeojohnson', 'auto'] = 'yeojohnson', standardize: bool = True, lambda_range: tuple[float, float] = (-4.0, 6.0), quantile: float = 0.99, nsteps: int = 2)[source]
Bases:
OneToOneFeatureMixin,TransformerMixin,BaseEstimatorApply a robust power transformation using reweighted maximum likelihood to transform the features closer to normality. Uses the Box-Cox or the Yeo-Johnson transformation.
- Parameters:
method (Literal str, optional) – method used for the power transformation. Can be “boxcox” for Box-Cox, “yeojohnson” for Yeo-Johnson, or “auto” for best objective solution. Box-Cox can only be used for strictly positive features. Defaults to “auto”.
standardize (boolean, optional) – whether to standardize the features before and after the power transformation. Defaults to True.
quantile (float, optional) – quantile used to calculate the weights. Defaults to 0.99.
nsteps (int, optional) – number of steps used in the reweighted maximum likelihood. Defaults to 2.
- fit(x: ndarray)[source]
Calculates lambda, the transformation parameter depending on the method.
- Parameters:
x (np.array) – data.
Utils
- robpy.preprocessing.utils.wrapping_transformation(X: ~numpy.ndarray, b: float = 1.5, c: float = 4.0, q1: float = 1.540793, q2: float = 0.8622731, rescale: bool = False, location_estimator: ~typing.Callable[[~numpy.ndarray, int], ~numpy.ndarray] = <function median>, scale_estimator: ~typing.Callable[[~numpy.ndarray, int], ~numpy.ndarray] = <function median_abs_deviation>) ndarray[source]
Implementation of wrapping using this transformation function:
\[\begin{split}\Psi_{b, c}(z) = \begin{cases} z & if \ 0 \leq |z| < b \\ q_1 \tanh\left(q_2 (c - |z|)\right) \mathrm{sign}(z) & if \ b \leq |z| \leq c \\ 0 & if \ c < |z| \end{cases}\end{split}\]- Parameters:
X – data to be transformed, must have shape (N, D)
b – lower cutoff
c – upper cutoff
q1 – transformation parameters (see formula)
q2 – transformation parameters (see formula)
rescale – whether to rescale the wrapped data so the robust location and scale of the transformed data are the same as the original data
locations – location estimates of the columns of X (optional)
scales – scale estimates of the columns of X (optional)
- Returns:
transformed data