TopGear Data

This notebook demonstrates most of the functionality offered by the RobPy package through an application to the TopGear dataset. Utility functions and univariate estimators are not demonstrated separately, as they are mostly implemented as helpers to the multivariate estimators.

Imports

[1]:

import json
import matplotlib.pyplot as plt
import numpy as np

from robpy.datasets import load_topgear
from robpy.preprocessing import DataCleaner, RobustPowerTransformer, RobustScaler
from robpy.covariance import FastMCD, OGK
from robpy.pca import ROBPCA
from robpy.outliers import DDC
from robpy.regression import MMRegression
from robpy.univariate import adjusted_boxplot

%load_ext autoreload
%autoreload 2

Load data

RobPy has a dataset module which allows the user quick access to a few common datasets used in robust statistics literature. In this demo, we will work with the TopGear dataset.

[2]:

data = load_topgear(as_frame=True)

print(data.DESCR)

.. _topgear_dataset:

TopGear dataset
--------------------

**Data Set Characteristics:**

- Number of Instances: 297
- Number of Attributes: 32 (13 numeric, 19 categorical)
- Attribute Information:
    * Make (str): the car brand.
    * Model (str): the car model.
    * Type (str): the exact model type.
    * Fuel (str): the type of fuel ("Diesel" or "Petrol").
    * Price (float): the list price (in UK pounds)
    * Cylinders (float): the number of cylinders in the engine.
    * Displacement (float): the displacement of the engine (in cc).
    * DriveWheel (str): the type of drive wheel ("4WD", "Front" or "Rear").
    * BHP (float): the power of the engine (in bhp).
    * Torque (float): the torque of the engine (in lb/ft).
    * Acceleration (float): the time it takes the car to get from 0 to 62 mph (in seconds).
    * TopSpeed (float): the car's top speed (in mph).
    * MPG (float): the combined fuel consuption (urban + extra urban; in miles per gallon).
    * Weight (float): the car's curb weight (in kg).
    * Length (float): the car's length (in mm).
    * Width (float): the car's width (in mm).
    * Height (float): the car's height (in mm).
    * AdaptiveHeadlights (str): whether the car has adaptive headlights ("no", "optional" or "standard").
    * AdjustableSteering (str): whether the car has adjustable steering ("no" or "standard").
    * AlarmSystem (str) whether the car has an alarm system ("no/optional" or "standard").
    * Automatic (str) whether the car has an automatic transmission ("no", "optional" or "standard").
    * Bluetooth (str) whether the car has bluetooth ("no", "optional" or "standard").
    * ClimateControl (str) whether the car has climate control ("no", "optional" or "standard").
    * CruiseControl (str) whether the car has cruise control ("no", "optional" or "standard").
    * ElectricSeats (str) whether the car has electric seats ("no", "optional" or "standard").
    * Leather (str) whether the car has a leather interior ("no", "optional" or "standard").
    * ParkingSensors (str) whether the car has parking sensors ("no", "optional" or "standard").
    * PowerSteering (str) whether the car has power steering ("no" or "standard").
    * SatNav (str) whether the car has a satellite navigation system ("no", "optional" or "standard").
    * ESP (str) whether the car has ESP ("no", "optional" or "standard").
    * Verdict (float) review score between 1 (lowest) and 10 (highest).
    * Origin (str) the origin of the car maker ("Asia", "Europe" or "USA").

- Creator: BBC TopGear
- Source: https://rdrr.io/cran/robustHD/man/TopGear.html

**Data Set Description:**
The data set contains information on cars featured on the website of the popular BBC television show Top Gear.
The data were scraped from http://www.topgear.com/uk/ on 2014-02-24. Variable Origin was added based on the car maker information.

The DESCR attribute of the data object contains all metadata and a description of the dataset.

[3]:

data.data.head()

[3]:

	Make	Model	Type	Fuel	Price	Cylinders	Displacement	DriveWheel	BHP	Torque	...	ClimateControl	CruiseControl	ElectricSeats	Leather	ParkingSensors	PowerSteering	SatNav	ESP	Verdict	Origin
0	Alfa Romeo	Giulietta	Giulietta 1.6 JTDM-2 105 Veloce 5d	Diesel	21250.0	4.0	1598.0	Front	105.0	236.0	...	standard	standard	optional	optional	optional	standard	optional	standard	6.0	Europe
1	Alfa Romeo	MiTo	MiTo 1.4 TB MultiAir 105 Distinctive 3d	Petrol	15155.0	4.0	1368.0	Front	105.0	95.0	...	optional	standard	no	optional	standard	standard	optional	standard	5.0	Europe
2	Aston Martin	Cygnet	Cygnet 1.33 Standard 3d	Petrol	30995.0	4.0	1329.0	Front	98.0	92.0	...	standard	standard	no	no	no	standard	standard	standard	7.0	Europe
3	Aston Martin	DB9	DB9 6.0 517 Standard 2d 13MY	Petrol	131995.0	12.0	5935.0	Rear	517.0	457.0	...	standard	standard	standard	standard	standard	standard	standard	standard	7.0	Europe
4	Aston Martin	DB9 Volante	DB9 6.0 V12 517 Volante 2d 13MY	Petrol	141995.0	12.0	5935.0	Rear	517.0	457.0	...	standard	standard	standard	standard	standard	standard	standard	standard	7.0	Europe

5 rows × 32 columns

Preprocess data

Cleaning

We use DataCleaner to remove non-numeric columns, as well as columns and rows with too many missing values, discrete columns, columns with a scale of zero and columns corresponding to the case numbers.

[4]:

cleaner = DataCleaner().fit(data.data)
clean_data = cleaner.transform(data.data)

We can inspect the dropped columns by type:

[5]:

print(json.dumps(cleaner.dropped_columns, indent=4))

{
    "non_numeric_cols": [
        "Make",
        "Model",
        "Type",
        "Fuel",
        "DriveWheel",
        "AdaptiveHeadlights",
        "AdjustableSteering",
        "AlarmSystem",
        "Automatic",
        "Bluetooth",
        "ClimateControl",
        "CruiseControl",
        "ElectricSeats",
        "Leather",
        "ParkingSensors",
        "PowerSteering",
        "SatNav",
        "ESP",
        "Origin"
    ],
    "cols_rownumbers": [],
    "cols_discrete": [
        "Fuel",
        "DriveWheel",
        "AdaptiveHeadlights",
        "AdjustableSteering",
        "AlarmSystem",
        "Automatic",
        "Bluetooth",
        "ClimateControl",
        "CruiseControl",
        "ElectricSeats",
        "Leather",
        "ParkingSensors",
        "PowerSteering",
        "SatNav",
        "ESP",
        "Origin"
    ],
    "cols_bad_scale": [
        "Cylinders"
    ],
    "cols_missings": []
}

As well as the dropped rows:

[6]:

cleaner.dropped_rows

[6]:

{'rows_missings': [69, 95]}

Let’s also exclude the subjective variable Verdict from the features for further analysis.

[7]:

clean_data = clean_data.drop(columns=['Verdict'])

Transforming

The Price variable has a long tail as can be seen from its adjusted boxplot and its histogram. It’s a good idea to apply a power transformation to it to make it more symmetric.

[8]:

adjusted_boxplot(clean_data['Price'],figsize=(2,2))

[8]:

[Boxplot(median=26505.0, q1=18902.5, q3=44252.5, upper_whisker=191797.28737567586, lower_whisker=12666.161330745548)]

[9]:

clean_data['Price'].hist(bins=20, figsize=(4,2))

[9]:

<Axes: >

[10]:

price_transformer = RobustPowerTransformer(method='auto').fit(clean_data['Price'])

clean_data['Price_transformed'] = price_transformer.transform(clean_data['Price'])

We can inspect which method was selected as well as which lambda value was applied for the transformation:

[11]:

price_transformer.method, price_transformer.lambda_rew

[11]:

('boxcox', -0.42354039562300644)

[12]:

adjusted_boxplot(clean_data['Price_transformed'],figsize=(2,2))

[12]:

[Boxplot(median=-0.03153400648124502, q1=-0.5679781382767539, q3=0.6485562387582648, upper_whisker=2.7449937256642296, lower_whisker=-2.0160158561602897)]

[13]:

clean_data['Price_transformed'].hist(bins=20, figsize=(4,2))

[13]:

<Axes: >

It’s best to also transform displacement, HP, Torque and Topspeed as these variables are also skewed.

[14]:

fig, axs = plt.subplots(2, 2, figsize=(15, 8))
for col, ax in zip(['Displacement', 'BHP', 'Torque', 'TopSpeed'], axs.flatten()):
    clean_data[col].hist(ax=ax, bins=20, alpha=0.3)
    transformer = RobustPowerTransformer(method='auto').fit(clean_data[col].dropna())
    clean_data.loc[~np.isnan(clean_data[col]), col] = transformer.transform(clean_data[col].dropna())
    ax2=ax.twiny()
    clean_data[col].hist(ax=ax2, bins=20, label='transformed', color='orange', alpha=0.3)
    ax.grid(False)
    ax2.grid(False)
    ax2.legend(loc='upper right')
    ax.set_title(f'{col}: method = {transformer.method}, lambda = {transformer.lambda_rew:.3f}')
fig.tight_layout()

Dropping NA

Some methods require the data to be entirely NA free.

[15]:

clean_data2 = clean_data.dropna()

Covariance

First we study the covariance of the dataset by calculating a robust covariance matrix on the numeric features.

MCD: Minimum Covariance Determinant

[16]:

mcd = FastMCD().fit(clean_data2.drop(columns=['Price']))

[17]:

fig = mcd.distance_distance_plot()

OGK: Orthogonalized Gnanadesikan-Kettenring covariance

We can do the same analysis with other robust covariance estimators, e.g. the OGK covariance:

[18]:

ogk = OGK().fit(clean_data2.drop(columns=['Price']))

[19]:

fig = ogk.distance_distance_plot()

Let’s have a look at the point that seems to lie very far from the majority of the data:

[20]:

data.data.loc[
    clean_data2.index[(ogk._robust_distances > 80) & (ogk._mahalanobis_distances > 12)],
    ['Make', 'Model'] + list(set(clean_data2.columns).intersection(set(data.data.columns)))
]

[20]:

	Make	Model	Price	Torque	Displacement	BHP	Acceleration	Height	Length	TopSpeed	Weight	MPG	Width
41	BMW	i3	33830.0	184.0	647.0	170.0	7.9	1578.0	3999.0	93.0	1315.0	470.0	1775.0

CellMCD

A robust covariance estimator that can handle missing values is the CellMCD. For more details, we refer to the separate notebook on this topic.

PCA

Next, we apply a well-known robust PCA method to the data, specifically ROBPCA. As the variables of interest have different measurement units and different scales, we first scale the data.

[21]:

scaled_data = RobustScaler(with_centering=False).fit_transform(clean_data2.drop(columns=['Price']))
pca = ROBPCA().fit(scaled_data)

[22]:

score_distances, orthogonal_distances, score_cutoff, od_cutoff = pca.plot_outlier_map(scaled_data, return_distances=True)

We can inspect the bad leverage points:

[23]:

data.data.loc[
    clean_data2.loc[(score_distances > score_cutoff) & (orthogonal_distances > od_cutoff)].index,
    ['Make', 'Model'] + list(set(clean_data2.columns).intersection(set(data.data.columns)))
]

[23]:

	Make	Model	Price	Torque	Displacement	BHP	Acceleration	Height	Length	TopSpeed	Weight	MPG	Width
41	BMW	i3	33830.0	184.0	647.0	170.0	7.9	1578.0	3999.0	93.0	1315.0	470.0	1775.0
49	Bugatti	Veyron	1139985.0	922.0	7993.0	987.0	2.5	1204.0	4462.0	252.0	1990.0	10.0	1998.0
135	Land Rover	Defender	28195.0	265.0	2198.0	122.0	14.7	1790.0	4785.0	90.0	2120.0	25.0	1790.0
164	Mercedes-Benz	G-Class	82945.0	398.0	2987.0	211.0	9.1	1951.0	4662.0	108.0	2500.0	25.0	1760.0
196	Pagani	Huayra	990000.0	811.0	5980.0	730.0	3.3	1169.0	4605.0	230.0	1350.0	23.0	2036.0

Outlier detection

We can also detect cellwise outliers with the DDC estimator. Here we can use the data with NAs, as DDC can handle missing values.

[24]:

ddc = DDC().fit(clean_data.drop(columns=['Price']))

[25]:

ddc.cellmap(clean_data.drop(columns=['Price']), figsize=(15,30))

[25]:

<Axes: >

[26]:

row_indices = np.array([ 11,  41,  55,  73,  81,  94,  99, 135, 150, 164, 176, 198, 209,
       215, 234, 241, 277])
ax = ddc.cellmap(clean_data.drop(columns=['Price']), figsize=(8,10), row_zoom=row_indices)
cars = data.data.apply(lambda row: f"{row['Make']} {row['Model']}", axis=1).tolist()
ax.set_yticklabels([cars[i] for i in row_indices], rotation=0);

We can additionally zoom in on 2 BMWs:

[27]:

ax = ddc.cellmap(clean_data.drop(columns=['Price']),row_zoom=[31,41], figsize=(7,1))
ax.set_yticklabels([cars[i] for i in [31,41]], rotation=0);

total flagged cells:

[28]:

ddc.cellwise_outliers_.sum()

[28]:

DDC also predicts rowwise outliers:

[29]:

clean_data.loc[ddc.predict(clean_data.drop(columns=['Price']), rowwise=True)]

[29]:

	Price	Displacement	BHP	Torque	Acceleration	TopSpeed	MPG	Weight	Length	Width	Height	Price_transformed
50	44995.0	0.219169	0.701029	-0.220938	3.1	0.921638	NaN	575.0	3300.0	1685.0	1140.0	0.668256
51	22995.0	-0.569636	-0.524540	-1.071051	5.9	-0.748566	NaN	550.0	3380.0	1575.0	1115.0	-0.247651
61	30245.0	0.613284	0.109823	0.297162	12.8	-0.590064	35.0	2305.0	5218.0	1998.0	1818.0	0.157953
124	25995.0	0.613284	0.276559	0.782252	12.9	-1.033160	34.0	2075.0	4223.0	1873.0	1840.0	-0.060330
135	28195.0	0.161107	-0.571367	0.297162	14.7	-2.246128	25.0	2120.0	4785.0	1790.0	1790.0	0.058518
145	36200.0	NaN	NaN	NaN	0.0	NaN	NaN	876.0	3785.0	1850.0	1117.0	0.399504
180	29045.0	NaN	-1.988366	-0.920764	15.9	-2.161801	NaN	1110.0	3475.0	1475.0	1610.0	0.100958
184	30000.0	-0.075638	-0.775908	-1.348513	4.5	-0.344059	NaN	490.0	NaN	NaN	NaN	0.146581
219	6950.0	NaN	-6.267469	-2.504821	0.0	-8.518526	NaN	450.0	2337.0	1237.0	1461.0	-2.690158
234	17995.0	-0.042303	-0.128325	0.297162	0.0	-1.154841	38.0	2059.0	5125.0	1915.0	1845.0	-0.652657

Finally, DDC can impute missing data:

[30]:

ddc.impute(clean_data.drop(columns=['Price'])).loc[ddc.predict(clean_data.drop(columns=['Price']), rowwise=True), :].style.format('{:.2f}')

[30]:

	Displacement	BHP	Torque	Acceleration	TopSpeed	MPG	Weight	Length	Width	Height	Price_transformed
50	0.22	-0.04	-0.22	8.61	0.92	41.83	1364.54	4331.18	1685.00	1140.00	-0.26
51	-0.57	-0.52	-1.07	11.65	-0.75	55.07	550.00	3380.00	1722.44	1482.87	-1.19
61	0.61	0.11	0.30	7.88	-0.59	35.00	2305.00	5218.00	1869.97	1818.00	0.16
124	0.61	0.28	0.78	8.73	-1.03	34.00	2075.00	4223.00	1873.00	1840.00	-0.06
135	0.16	-0.57	0.30	9.85	-0.23	25.00	1483.18	4785.00	1790.00	1790.00	0.06
145	0.21	0.49	0.12	8.79	0.87	36.24	1515.46	3785.00	1850.00	1117.00	0.40
180	-1.54	-1.99	-0.92	15.90	-2.16	65.02	1110.00	3475.00	1693.47	1610.00	-1.91
184	-0.08	-0.78	-1.35	11.16	-0.34	50.55	1391.73	4163.08	1757.12	1447.99	0.15
219	-1.47	-1.12	-1.12	15.50	-0.61	58.45	450.00	3667.15	1654.07	1461.00	-0.89
234	-0.04	0.56	0.30	9.32	0.50	38.00	2059.00	5125.00	1915.00	1428.56	0.63

Regression

We can use robust regression to predict the (transformed) price using the remaining variables. For regression, we again have to drop all missings.

[31]:

X = clean_data2.drop(columns=['Price', 'Price_transformed'])
y = clean_data2['Price_transformed']

As an example, we use the MM-estimator of regression:

[32]:

estimator = MMRegression().fit(X, y)
estimator.model.coef_

[32]:

array([ 2.71338919e-01,  4.24224370e-01,  2.04835077e-01,  3.63925688e-02,
        8.75191584e-02,  3.77330897e-03,  4.65624836e-04, -2.97257613e-04,
        8.37238866e-04, -9.42993337e-04])

We can now get a diagnostic plot and ask for the underlying data

[33]:

resid, std_resid, distances, vt, ht = estimator.outlier_map(X, y.to_numpy(), return_data=True)

Now we can get an overview of the bad leverage points:

[34]:

bad_leverage_idx = (np.abs(std_resid) > vt) & (distances > ht)
data.data.loc[
    clean_data2[bad_leverage_idx].index, ['Make', 'Model', 'Price']
].assign(predicted_price=price_transformer.inverse_transform(estimator.predict(X.loc[bad_leverage_idx])).round())

[34]:

	Make	Model	Price	predicted_price
2	Aston Martin	Cygnet	30995.0	15326.0
5	Aston Martin	V12 Zagato	396000.0	117938.0
164	Mercedes-Benz	G-Class	82945.0	34576.0
222	Rolls-Royce	Phantom	352720.0	116166.0
223	Rolls-Royce	Phantom Coupe	333130.0	111003.0
253	Toyota	Prius	24045.0	16272.0