TopGear Data

This notebook demonstrates most of the functionality offered by the RobPy package through an application to the TopGear dataset. Utility functions and univariate estimators are not demonstrated separately, as they are mostly implemented as helpers to the multivariate estimators.

Imports

[1]:

import json import matplotlib.pyplot as plt import numpy as np from robpy.datasets import load_topgear from robpy.preprocessing import DataCleaner, RobustPowerTransformer, RobustScaler from robpy.covariance import FastMCD, OGK from robpy.pca import ROBPCA from robpy.outliers import DDC from robpy.regression import MMRegression from robpy.univariate import adjusted_boxplot %load_ext autoreload %autoreload 2

Load data

RobPy has a dataset module which allows the user quick access to a few common datasets used in robust statistics literature. In this demo, we will work with the TopGear dataset.

[2]:
data = load_topgear(as_frame=True)

print(data.DESCR)
.. _topgear_dataset:

TopGear dataset
--------------------

**Data Set Characteristics:**

- Number of Instances: 297
- Number of Attributes: 32 (13 numeric, 19 categorical)
- Attribute Information:
    * Make (str): the car brand.
    * Model (str): the car model.
    * Type (str): the exact model type.
    * Fuel (str): the type of fuel ("Diesel" or "Petrol").
    * Price (float): the list price (in UK pounds)
    * Cylinders (float): the number of cylinders in the engine.
    * Displacement (float): the displacement of the engine (in cc).
    * DriveWheel (str): the type of drive wheel ("4WD", "Front" or "Rear").
    * BHP (float): the power of the engine (in bhp).
    * Torque (float): the torque of the engine (in lb/ft).
    * Acceleration (float): the time it takes the car to get from 0 to 62 mph (in seconds).
    * TopSpeed (float): the car's top speed (in mph).
    * MPG (float): the combined fuel consuption (urban + extra urban; in miles per gallon).
    * Weight (float): the car's curb weight (in kg).
    * Length (float): the car's length (in mm).
    * Width (float): the car's width (in mm).
    * Height (float): the car's height (in mm).
    * AdaptiveHeadlights (str): whether the car has adaptive headlights ("no", "optional" or "standard").
    * AdjustableSteering (str): whether the car has adjustable steering ("no" or "standard").
    * AlarmSystem (str) whether the car has an alarm system ("no/optional" or "standard").
    * Automatic (str) whether the car has an automatic transmission ("no", "optional" or "standard").
    * Bluetooth (str) whether the car has bluetooth ("no", "optional" or "standard").
    * ClimateControl (str) whether the car has climate control ("no", "optional" or "standard").
    * CruiseControl (str) whether the car has cruise control ("no", "optional" or "standard").
    * ElectricSeats (str) whether the car has electric seats ("no", "optional" or "standard").
    * Leather (str) whether the car has a leather interior ("no", "optional" or "standard").
    * ParkingSensors (str) whether the car has parking sensors ("no", "optional" or "standard").
    * PowerSteering (str) whether the car has power steering ("no" or "standard").
    * SatNav (str) whether the car has a satellite navigation system ("no", "optional" or "standard").
    * ESP (str) whether the car has ESP ("no", "optional" or "standard").
    * Verdict (float) review score between 1 (lowest) and 10 (highest).
    * Origin (str) the origin of the car maker ("Asia", "Europe" or "USA").

- Creator: BBC TopGear
- Source: https://rdrr.io/cran/robustHD/man/TopGear.html

**Data Set Description:**
The data set contains information on cars featured on the website of the popular BBC television show Top Gear.
The data were scraped from http://www.topgear.com/uk/ on 2014-02-24. Variable Origin was added based on the car maker information.



The DESCR attribute of the data object contains all metadata and a description of the dataset.

[3]:
data.data.head()
[3]:
Make Model Type Fuel Price Cylinders Displacement DriveWheel BHP Torque ... ClimateControl CruiseControl ElectricSeats Leather ParkingSensors PowerSteering SatNav ESP Verdict Origin
0 Alfa Romeo Giulietta Giulietta 1.6 JTDM-2 105 Veloce 5d Diesel 21250.0 4.0 1598.0 Front 105.0 236.0 ... standard standard optional optional optional standard optional standard 6.0 Europe
1 Alfa Romeo MiTo MiTo 1.4 TB MultiAir 105 Distinctive 3d Petrol 15155.0 4.0 1368.0 Front 105.0 95.0 ... optional standard no optional standard standard optional standard 5.0 Europe
2 Aston Martin Cygnet Cygnet 1.33 Standard 3d Petrol 30995.0 4.0 1329.0 Front 98.0 92.0 ... standard standard no no no standard standard standard 7.0 Europe
3 Aston Martin DB9 DB9 6.0 517 Standard 2d 13MY Petrol 131995.0 12.0 5935.0 Rear 517.0 457.0 ... standard standard standard standard standard standard standard standard 7.0 Europe
4 Aston Martin DB9 Volante DB9 6.0 V12 517 Volante 2d 13MY Petrol 141995.0 12.0 5935.0 Rear 517.0 457.0 ... standard standard standard standard standard standard standard standard 7.0 Europe

5 rows × 32 columns

Preprocess data

Cleaning

We use DataCleaner to remove non-numeric columns, as well as columns and rows with too many missing values, discrete columns, columns with a scale of zero and columns corresponding to the case numbers.

[4]:
cleaner = DataCleaner().fit(data.data)
clean_data = cleaner.transform(data.data)

We can inspect the dropped columns by type:

[5]:
print(json.dumps(cleaner.dropped_columns, indent=4))
{
    "non_numeric_cols": [
        "Make",
        "Model",
        "Type",
        "Fuel",
        "DriveWheel",
        "AdaptiveHeadlights",
        "AdjustableSteering",
        "AlarmSystem",
        "Automatic",
        "Bluetooth",
        "ClimateControl",
        "CruiseControl",
        "ElectricSeats",
        "Leather",
        "ParkingSensors",
        "PowerSteering",
        "SatNav",
        "ESP",
        "Origin"
    ],
    "cols_rownumbers": [],
    "cols_discrete": [
        "Fuel",
        "DriveWheel",
        "AdaptiveHeadlights",
        "AdjustableSteering",
        "AlarmSystem",
        "Automatic",
        "Bluetooth",
        "ClimateControl",
        "CruiseControl",
        "ElectricSeats",
        "Leather",
        "ParkingSensors",
        "PowerSteering",
        "SatNav",
        "ESP",
        "Origin"
    ],
    "cols_bad_scale": [
        "Cylinders"
    ],
    "cols_missings": []
}

As well as the dropped rows:

[6]:
cleaner.dropped_rows
[6]:
{'rows_missings': [69, 95]}

Let’s also exclude the subjective variable Verdict from the features for further analysis.

[7]:
clean_data = clean_data.drop(columns=['Verdict'])

Transforming

The Price variable has a long tail as can be seen from its adjusted boxplot and its histogram. It’s a good idea to apply a power transformation to it to make it more symmetric.

[8]:
adjusted_boxplot(clean_data['Price'],figsize=(2,2))
[8]:
[Boxplot(median=26505.0, q1=18902.5, q3=44252.5, upper_whisker=191797.28737567586, lower_whisker=12666.161330745548)]
../_images/notebooks_topgear_21_1.png
[9]:
clean_data['Price'].hist(bins=20, figsize=(4,2))
[9]:
<Axes: >
../_images/notebooks_topgear_22_1.png
[10]:
price_transformer = RobustPowerTransformer(method='auto').fit(clean_data['Price'])

clean_data['Price_transformed'] = price_transformer.transform(clean_data['Price'])

We can inspect which method was selected as well as which lambda value was applied for the transformation:

[11]:
price_transformer.method, price_transformer.lambda_rew
[11]:
('boxcox', -0.42354039562300644)
[12]:
adjusted_boxplot(clean_data['Price_transformed'],figsize=(2,2))
[12]:
[Boxplot(median=-0.03153400648124502, q1=-0.5679781382767539, q3=0.6485562387582648, upper_whisker=2.7449937256642296, lower_whisker=-2.0160158561602897)]
../_images/notebooks_topgear_26_1.png
[13]:
clean_data['Price_transformed'].hist(bins=20, figsize=(4,2))
[13]:
<Axes: >
../_images/notebooks_topgear_27_1.png

It’s best to also transform displacement, HP, Torque and Topspeed as these variables are also skewed.

[14]:
fig, axs = plt.subplots(2, 2, figsize=(15, 8))
for col, ax in zip(['Displacement', 'BHP', 'Torque', 'TopSpeed'], axs.flatten()):
    clean_data[col].hist(ax=ax, bins=20, alpha=0.3)
    transformer = RobustPowerTransformer(method='auto').fit(clean_data[col].dropna())
    clean_data.loc[~np.isnan(clean_data[col]), col] = transformer.transform(clean_data[col].dropna())
    ax2=ax.twiny()
    clean_data[col].hist(ax=ax2, bins=20, label='transformed', color='orange', alpha=0.3)
    ax.grid(False)
    ax2.grid(False)
    ax2.legend(loc='upper right')
    ax.set_title(f'{col}: method = {transformer.method}, lambda = {transformer.lambda_rew:.3f}')
fig.tight_layout()
../_images/notebooks_topgear_29_0.png

Dropping NA

Some methods require the data to be entirely NA free.

[15]:
clean_data2 = clean_data.dropna()

Covariance

First we study the covariance of the dataset by calculating a robust covariance matrix on the numeric features.

MCD: Minimum Covariance Determinant

[16]:
mcd = FastMCD().fit(clean_data2.drop(columns=['Price']))
[17]:
fig = mcd.distance_distance_plot()
../_images/notebooks_topgear_37_0.png

OGK: Orthogonalized Gnanadesikan-Kettenring covariance

We can do the same analysis with other robust covariance estimators, e.g. the OGK covariance:

[18]:
ogk = OGK().fit(clean_data2.drop(columns=['Price']))
[19]:
fig = ogk.distance_distance_plot()
../_images/notebooks_topgear_41_0.png

Let’s have a look at the point that seems to lie very far from the majority of the data:

[20]:
data.data.loc[
    clean_data2.index[(ogk._robust_distances > 80) & (ogk._mahalanobis_distances > 12)],
    ['Make', 'Model'] + list(set(clean_data2.columns).intersection(set(data.data.columns)))
]
[20]:
Make Model Price Torque Displacement BHP Acceleration Height Length TopSpeed Weight MPG Width
41 BMW i3 33830.0 184.0 647.0 170.0 7.9 1578.0 3999.0 93.0 1315.0 470.0 1775.0

CellMCD

A robust covariance estimator that can handle missing values is the CellMCD. For more details, we refer to the separate notebook on this topic.

PCA

Next, we apply a well-known robust PCA method to the data, specifically ROBPCA. As the variables of interest have different measurement units and different scales, we first scale the data.

[21]:
scaled_data = RobustScaler(with_centering=False).fit_transform(clean_data2.drop(columns=['Price']))
pca = ROBPCA().fit(scaled_data)
[22]:
score_distances, orthogonal_distances, score_cutoff, od_cutoff = pca.plot_outlier_map(scaled_data, return_distances=True)
../_images/notebooks_topgear_49_0.png

We can inspect the bad leverage points:

[23]:
data.data.loc[
    clean_data2.loc[(score_distances > score_cutoff) & (orthogonal_distances > od_cutoff)].index,
    ['Make', 'Model'] + list(set(clean_data2.columns).intersection(set(data.data.columns)))
]
[23]:
Make Model Price Torque Displacement BHP Acceleration Height Length TopSpeed Weight MPG Width
41 BMW i3 33830.0 184.0 647.0 170.0 7.9 1578.0 3999.0 93.0 1315.0 470.0 1775.0
49 Bugatti Veyron 1139985.0 922.0 7993.0 987.0 2.5 1204.0 4462.0 252.0 1990.0 10.0 1998.0
135 Land Rover Defender 28195.0 265.0 2198.0 122.0 14.7 1790.0 4785.0 90.0 2120.0 25.0 1790.0
164 Mercedes-Benz G-Class 82945.0 398.0 2987.0 211.0 9.1 1951.0 4662.0 108.0 2500.0 25.0 1760.0
196 Pagani Huayra 990000.0 811.0 5980.0 730.0 3.3 1169.0 4605.0 230.0 1350.0 23.0 2036.0

Outlier detection

We can also detect cellwise outliers with the DDC estimator. Here we can use the data with NAs, as DDC can handle missing values.

[24]:
ddc = DDC().fit(clean_data.drop(columns=['Price']))
[25]:
ddc.cellmap(clean_data.drop(columns=['Price']), figsize=(15,30))
[25]:
<Axes: >
../_images/notebooks_topgear_55_1.png
[26]:
row_indices = np.array([ 11,  41,  55,  73,  81,  94,  99, 135, 150, 164, 176, 198, 209,
       215, 234, 241, 277])
ax = ddc.cellmap(clean_data.drop(columns=['Price']), figsize=(8,10), row_zoom=row_indices)
cars = data.data.apply(lambda row: f"{row['Make']} {row['Model']}", axis=1).tolist()
ax.set_yticklabels([cars[i] for i in row_indices], rotation=0);
../_images/notebooks_topgear_56_0.png

We can additionally zoom in on 2 BMWs:

[27]:
ax = ddc.cellmap(clean_data.drop(columns=['Price']),row_zoom=[31,41], figsize=(7,1))
ax.set_yticklabels([cars[i] for i in [31,41]], rotation=0);
../_images/notebooks_topgear_58_0.png

total flagged cells:

[28]:
ddc.cellwise_outliers_.sum()
[28]:
88

DDC also predicts rowwise outliers:

[29]:
clean_data.loc[ddc.predict(clean_data.drop(columns=['Price']), rowwise=True)]
[29]:
Price Displacement BHP Torque Acceleration TopSpeed MPG Weight Length Width Height Price_transformed
50 44995.0 0.219169 0.701029 -0.220938 3.1 0.921638 NaN 575.0 3300.0 1685.0 1140.0 0.668256
51 22995.0 -0.569636 -0.524540 -1.071051 5.9 -0.748566 NaN 550.0 3380.0 1575.0 1115.0 -0.247651
61 30245.0 0.613284 0.109823 0.297162 12.8 -0.590064 35.0 2305.0 5218.0 1998.0 1818.0 0.157953
124 25995.0 0.613284 0.276559 0.782252 12.9 -1.033160 34.0 2075.0 4223.0 1873.0 1840.0 -0.060330
135 28195.0 0.161107 -0.571367 0.297162 14.7 -2.246128 25.0 2120.0 4785.0 1790.0 1790.0 0.058518
145 36200.0 NaN NaN NaN 0.0 NaN NaN 876.0 3785.0 1850.0 1117.0 0.399504
180 29045.0 NaN -1.988366 -0.920764 15.9 -2.161801 NaN 1110.0 3475.0 1475.0 1610.0 0.100958
184 30000.0 -0.075638 -0.775908 -1.348513 4.5 -0.344059 NaN 490.0 NaN NaN NaN 0.146581
219 6950.0 NaN -6.267469 -2.504821 0.0 -8.518526 NaN 450.0 2337.0 1237.0 1461.0 -2.690158
234 17995.0 -0.042303 -0.128325 0.297162 0.0 -1.154841 38.0 2059.0 5125.0 1915.0 1845.0 -0.652657

Finally, DDC can impute missing data:

[30]:
ddc.impute(clean_data.drop(columns=['Price'])).loc[ddc.predict(clean_data.drop(columns=['Price']), rowwise=True), :].style.format('{:.2f}')
[30]:
  Displacement BHP Torque Acceleration TopSpeed MPG Weight Length Width Height Price_transformed
50 0.22 -0.04 -0.22 8.61 0.92 41.83 1364.54 4331.18 1685.00 1140.00 -0.26
51 -0.57 -0.52 -1.07 11.65 -0.75 55.07 550.00 3380.00 1722.44 1482.87 -1.19
61 0.61 0.11 0.30 7.88 -0.59 35.00 2305.00 5218.00 1869.97 1818.00 0.16
124 0.61 0.28 0.78 8.73 -1.03 34.00 2075.00 4223.00 1873.00 1840.00 -0.06
135 0.16 -0.57 0.30 9.85 -0.23 25.00 1483.18 4785.00 1790.00 1790.00 0.06
145 0.21 0.49 0.12 8.79 0.87 36.24 1515.46 3785.00 1850.00 1117.00 0.40
180 -1.54 -1.99 -0.92 15.90 -2.16 65.02 1110.00 3475.00 1693.47 1610.00 -1.91
184 -0.08 -0.78 -1.35 11.16 -0.34 50.55 1391.73 4163.08 1757.12 1447.99 0.15
219 -1.47 -1.12 -1.12 15.50 -0.61 58.45 450.00 3667.15 1654.07 1461.00 -0.89
234 -0.04 0.56 0.30 9.32 0.50 38.00 2059.00 5125.00 1915.00 1428.56 0.63

Regression

We can use robust regression to predict the (transformed) price using the remaining variables. For regression, we again have to drop all missings.

[31]:
X = clean_data2.drop(columns=['Price', 'Price_transformed'])
y = clean_data2['Price_transformed']

As an example, we use the MM-estimator of regression:

[32]:
estimator = MMRegression().fit(X, y)
estimator.model.coef_
[32]:
array([ 2.71338919e-01,  4.24224370e-01,  2.04835077e-01,  3.63925688e-02,
        8.75191584e-02,  3.77330897e-03,  4.65624836e-04, -2.97257613e-04,
        8.37238866e-04, -9.42993337e-04])

We can now get a diagnostic plot and ask for the underlying data

[33]:
resid, std_resid, distances, vt, ht = estimator.outlier_map(X, y.to_numpy(), return_data=True)
../_images/notebooks_topgear_71_0.png

Now we can get an overview of the bad leverage points:

[34]:
bad_leverage_idx = (np.abs(std_resid) > vt) & (distances > ht)
data.data.loc[
    clean_data2[bad_leverage_idx].index, ['Make', 'Model', 'Price']
].assign(predicted_price=price_transformer.inverse_transform(estimator.predict(X.loc[bad_leverage_idx])).round())
[34]:
Make Model Price predicted_price
2 Aston Martin Cygnet 30995.0 15326.0
5 Aston Martin V12 Zagato 396000.0 117938.0
164 Mercedes-Benz G-Class 82945.0 34576.0
222 Rolls-Royce Phantom 352720.0 116166.0
223 Rolls-Royce Phantom Coupe 333130.0 111003.0
253 Toyota Prius 24045.0 16272.0