Basic Data Cleaning with eBay Car Sales Data

I clean a messy dataset of eBay car sales data before performing simple analyses about car prices.
python
pandas
matplotlib
seaborn
Author

Migs Germar

Published

May 20, 2021

Unsplash | Parker Gibbs

Overview

In 2016, Kaggle user orgezleka scraped data about used car advertisements on eBay Kleinanzeigen, which is the eBay Classifieds website of Germany. The full dataset is still available on Used Cars Data (data.world 2016), with over 370,000 datapoints.

The project aims to answer the following questions:

  • Which brands of the used cars are most expensive?
  • Does a used car’s age affect its price?
  • Does a used car’s mileage affect its price?
  • Does the presence of unrepaired damage affect a car’s price?

More importantly, this project aims to showcase basic data cleaning procedures. Thus, even columns beyond the scope of the research questions will be considered for cleaning. The assumption is that analysis will eventually be performed on all columns.

Note

I wrote this notebook for the Dataquest course’s Guided Project: Exploring eBay Car Sales Data. The general project flow came from Dataquest. However, all of the text and code here are written by me unless stated otherwise.

Package Installs

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Data Overview

Below is the information about each column in the dataset, taken from Used Cars Data (data.world 2016).

  • dateCrawled : when this ad was first crawled, all field-values are taken from this date
  • name : “name” of the car
  • seller : private or dealer
  • offerType
  • price : the price on the ad to sell the car
  • abtest
  • vehicleType
  • yearOfRegistration : at which year the car was first registered
  • gearbox
  • powerPS : power of the car in PS
  • model
  • kilometer : how many kilometers the car has driven
  • monthOfRegistration : at which month the car was first registered
  • fuelType
  • brand
  • notRepairedDamage : if the car has a damage which is not repaired yet
  • dateCreated : the date for which the ad at ebay was created
  • nrOfPictures : number of pictures in the ad
  • postalCode
  • lastSeenOnline : when the crawler saw this ad last online

Let us view the first 5 rows of the dataset below.

autos = pd.read_csv("./private/2021-05-20-BDC-Files/autos.csv", encoding = "Windows-1252")

autos.head()
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model kilometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-24 11:52:17 Golf_3_1.6 privat Angebot 480 test NaN 1993 manuell 0 golf 150000 0 benzin volkswagen NaN 2016-03-24 00:00:00 0 70435 2016-04-07 03:16:57
1 2016-03-24 10:58:45 A5_Sportback_2.7_Tdi privat Angebot 18300 test coupe 2011 manuell 190 NaN 125000 5 diesel audi ja 2016-03-24 00:00:00 0 66954 2016-04-07 01:46:50
2 2016-03-14 12:52:21 Jeep_Grand_Cherokee_"Overland" privat Angebot 9800 test suv 2004 automatik 163 grand 125000 8 diesel jeep NaN 2016-03-14 00:00:00 0 90480 2016-04-05 12:47:46
3 2016-03-17 16:54:04 GOLF_4_1_4__3TÜRER privat Angebot 1500 test kleinwagen 2001 manuell 75 golf 150000 6 benzin volkswagen nein 2016-03-17 00:00:00 0 91074 2016-03-17 17:40:17
4 2016-03-31 17:25:20 Skoda_Fabia_1.4_TDI_PD_Classic privat Angebot 3600 test kleinwagen 2008 manuell 69 fabia 90000 7 diesel skoda nein 2016-03-31 00:00:00 0 60437 2016-04-06 10:17:21

We can get additional information about each column using DataFrame.info().

autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-null  object
 16  dateCreated          371528 non-null  object
 17  nrOfPictures         371528 non-null  int64 
 18  postalCode           371528 non-null  int64 
 19  lastSeen             371528 non-null  object
dtypes: int64(7), object(13)
memory usage: 56.7+ MB

The dataset has 371,528 rows and 20 columns. Each column contains one of two types of data:

  • int64: 64-bit integers
  • object: text strings

Also, some columns have a non-null count smaller than the number of entries. This means that there are missing values in these columns. The column notRepairedDamagehas the most missing values.

Data Cleaning

Column Labels

As seen earlier, all of the column labels are written in camel case. This means that the first letter of each word is capitalized, like CamelCase.

Column labels should ideally be written in snake_case so that these are easier to use. Thus, I will convert all column labels to snake case.

cols = list(autos.columns)

new_cols = []

for label in cols:
    # List of indices of the first letter of each word.
    cap_inds = [0] + [index for index, character in enumerate(label) if character.isupper()]
    
    # List of 2-tuples. Each tuple contains the start index of the current word
    # and the start index of the next word.
    zipped = list(zip(cap_inds, cap_inds[1:] + [None]))
    
    # Split the label into a list of words.
    # Make them lowercase and combine them with underscores.
    word_list = [label[i:j] for i, j in zipped]
    word_list = [word.lower() for word in word_list]
    new_label = "_".join(word_list)
    
    new_cols.append(new_label)
    
autos.columns = new_cols

autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   date_crawled           371528 non-null  object
 1   name                   371528 non-null  object
 2   seller                 371528 non-null  object
 3   offer_type             371528 non-null  object
 4   price                  371528 non-null  int64 
 5   abtest                 371528 non-null  object
 6   vehicle_type           333659 non-null  object
 7   year_of_registration   371528 non-null  int64 
 8   gearbox                351319 non-null  object
 9   power_p_s              371528 non-null  int64 
 10  model                  351044 non-null  object
 11  kilometer              371528 non-null  int64 
 12  month_of_registration  371528 non-null  int64 
 13  fuel_type              338142 non-null  object
 14  brand                  371528 non-null  object
 15  not_repaired_damage    299468 non-null  object
 16  date_created           371528 non-null  object
 17  nr_of_pictures         371528 non-null  int64 
 18  postal_code            371528 non-null  int64 
 19  last_seen              371528 non-null  object
dtypes: int64(7), object(13)
memory usage: 56.7+ MB

Now, all of the column labels are in snake case. However, I will manually change a few of them to make them neater.

convert = {
    "power_p_s": "power_ps",
    "nr_of_pictures": "num_pictures",
    "year_of_registration": "year_reg",
    "kilometer": "mileage_km",
    "month_of_registration": "month_reg",
    "not_repaired_damage": "damage",
}

autos.columns = pd.Series(autos.columns).replace(convert)

autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   date_crawled  371528 non-null  object
 1   name          371528 non-null  object
 2   seller        371528 non-null  object
 3   offer_type    371528 non-null  object
 4   price         371528 non-null  int64 
 5   abtest        371528 non-null  object
 6   vehicle_type  333659 non-null  object
 7   year_reg      371528 non-null  int64 
 8   gearbox       351319 non-null  object
 9   power_ps      371528 non-null  int64 
 10  model         351044 non-null  object
 11  mileage_km    371528 non-null  int64 
 12  month_reg     371528 non-null  int64 
 13  fuel_type     338142 non-null  object
 14  brand         371528 non-null  object
 15  damage        299468 non-null  object
 16  date_created  371528 non-null  object
 17  num_pictures  371528 non-null  int64 
 18  postal_code   371528 non-null  int64 
 19  last_seen     371528 non-null  object
dtypes: int64(7), object(13)
memory usage: 56.7+ MB

All of the column labels are now easy to use.

Uninformative Columns

Next, we will look for columns filled with mostly 1 value. The other values in them are underrepresented, so we wouldn’t be able to compare groups reliably.

We can inspect the columns using DataFrame.describe().

autos.describe(include = "all")
# Include all columns, including text.
date_crawled name seller offer_type price abtest vehicle_type year_reg gearbox power_ps model mileage_km month_reg fuel_type brand damage date_created num_pictures postal_code last_seen
count 371528 371528 371528 371528 3.715280e+05 371528 333659 371528.000000 351319 371528.000000 351044 371528.000000 371528.000000 338142 371528 299468 371528 371528.0 371528.00000 371528
unique 280500 233531 2 2 NaN 2 8 NaN 2 NaN 251 NaN NaN 7 40 2 114 NaN NaN 182806
top 2016-03-24 14:49:47 Ford_Fiesta privat Angebot NaN test limousine NaN manuell NaN golf NaN NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:45:59
freq 7 657 371525 371516 NaN 192585 95894 NaN 274214 NaN 30070 NaN NaN 223857 79640 263182 14450 NaN NaN 17
mean NaN NaN NaN NaN 1.729514e+04 NaN NaN 2004.577997 NaN 115.549477 NaN 125618.688228 5.734445 NaN NaN NaN NaN 0.0 50820.66764 NaN
std NaN NaN NaN NaN 3.587954e+06 NaN NaN 92.866598 NaN 192.139578 NaN 40112.337051 3.712412 NaN NaN NaN NaN 0.0 25799.08247 NaN
min NaN NaN NaN NaN 0.000000e+00 NaN NaN 1000.000000 NaN 0.000000 NaN 5000.000000 0.000000 NaN NaN NaN NaN 0.0 1067.00000 NaN
25% NaN NaN NaN NaN 1.150000e+03 NaN NaN 1999.000000 NaN 70.000000 NaN 125000.000000 3.000000 NaN NaN NaN NaN 0.0 30459.00000 NaN
50% NaN NaN NaN NaN 2.950000e+03 NaN NaN 2003.000000 NaN 105.000000 NaN 150000.000000 6.000000 NaN NaN NaN NaN 0.0 49610.00000 NaN
75% NaN NaN NaN NaN 7.200000e+03 NaN NaN 2008.000000 NaN 150.000000 NaN 150000.000000 9.000000 NaN NaN NaN NaN 0.0 71546.00000 NaN
max NaN NaN NaN NaN 2.147484e+09 NaN NaN 9999.000000 NaN 20000.000000 NaN 150000.000000 12.000000 NaN NaN NaN NaN 0.0 99998.00000 NaN

The top row gives the most frequent value in the column. The freq row tells exactly how often the top value occurs.

We can see that in the seller column, "privat" appears 371,525 times; this is close to the total number of entries. Also, in the offer_type column, "Angebot" appears 371,516 times.

Let us first check the unique values of the seller column.

autos["seller"].value_counts()
privat        371525
gewerblich         3
Name: seller, dtype: int64

In English, these mean “private” and “commercial.” It is likely that private listings are put up by individuals, whereas commercial listings are put up by used car dealer companies. Most listings scraped were apparently private.

One idea is to compare private listing prices to commercial listing prices. However, there are only 3 commercial listings, so this comparison would be unreliable. Therefore, we can drop this column.

autos.drop(columns = "seller", inplace = True)

Secondly, let’s check the unique values of the offer_type column.

autos["offer_type"].value_counts()
Angebot    371516
Gesuch         12
Name: offer_type, dtype: int64

In English, these mean “offer” and “request.” This isn’t very informative. It could have something to do with the system of sending Offers to buyers in eBay. In any case, it doesn’t seem interesting to analyze, and there are only 12 request listings. Thus, we can also drop this column.

autos.drop(columns = "offer_type", inplace = True)

Next, the numerical num_pictures column has a minimum of 0 and a maximum of 0.

autos["num_pictures"].describe()
count    371528.0
mean          0.0
std           0.0
min           0.0
25%           0.0
50%           0.0
75%           0.0
max           0.0
Name: num_pictures, dtype: float64

This means that none of the listings for used cars were found to have pictures. This data may be true or it may have come from an error, but either way, it is not useful. Thus, it will be dropped.

autos.drop(columns = "num_pictures", inplace = True)

Outliers in Numerical Columns

Next, we inspect the ranges of numerical columns to see if there are any unusually high or low values.

autos.describe(include = np.number)
price year_reg power_ps mileage_km month_reg postal_code
count 3.715280e+05 371528.000000 371528.000000 371528.000000 371528.000000 371528.00000
mean 1.729514e+04 2004.577997 115.549477 125618.688228 5.734445 50820.66764
std 3.587954e+06 92.866598 192.139578 40112.337051 3.712412 25799.08247
min 0.000000e+00 1000.000000 0.000000 5000.000000 0.000000 1067.00000
25% 1.150000e+03 1999.000000 70.000000 125000.000000 3.000000 30459.00000
50% 2.950000e+03 2003.000000 105.000000 150000.000000 6.000000 49610.00000
75% 7.200000e+03 2008.000000 150.000000 150000.000000 9.000000 71546.00000
max 2.147484e+09 9999.000000 20000.000000 150000.000000 12.000000 99998.00000

The following unusual characteristics are noticeable:

  • price: The minimum price is 0 (free) and the maximum price is larger than 2 billion.
  • year_reg: The earliest year of registration is 1000, and the latest year is 9999.
  • power_ps: The lowest power is 0 PS. The highest is 20,000 PS, which is much higher than the 75th percentile (only 150 PS).
  • month_reg: The month numbers range from 0 to 12, not 1 to 12.

All of these columns shall be cleaned of inaccurate data.

Price in Euros

It was seen that the minimum value for price data was 0, and the maximum is over 2,000,000,000. Since this data was scraped from the German eBay, we can assume that these prices are in euros.

Both the minimum and maximum values seem to be very unrealistic for the price of a used car, so we want to clean this data.

Let us create a frequency table of the prices. We will first sort it by price, ascending.

autos["price"].value_counts().sort_index()
0             10778
1              1189
2                12
3                 8
4                 1
              ...  
32545461          1
74185296          1
99000000          1
99999999         15
2147483647        1
Name: price, Length: 5597, dtype: int64

Two things are noticeable:

  • There are over 10,000 instances of 0 euro prices.
  • There are several extremely high outliers, not just the one above 2 billion.

Since it is unlikely that the cars are actually free, 0 may represent missing values. As for the high outliers, these may have resulted from inaccurate scraping.

The next question is, what is the most frequent price value? Is it 0?

autos["price"].value_counts().sort_values(ascending = False)
0        10778
500       5670
1500      5394
1000      4649
1200      4594
         ...  
2610         1
20620        1
16689        1
9964         1
10985        1
Name: price, Length: 5597, dtype: int64

Indeed, the most frequent price is 0, which makes it likely that it represents missing values. The next most frequent prices are 500, 1500, and 1000, most likely because these are easy for buyers to remember.

Now that we know about the low and high outliers, we can clean rows using the interquartile range or IQR. The IQR is bounded by the 25th and 75th percentiles. Reasonable lower and upper bounds can be approximated by multiplying the IQR length by 1.5 and:

  • subtracting it from the 25th percentile
  • adding it to the 75th percentile

Reference: Detection and Removal of Outliers in Python – An Easy to Understand Guide

Below, I use this method to find a lower and upper bound, and I drop rows outside of them.

autos_desc = autos["price"].describe()

q25, q75 = autos_desc[["25%", "75%"]]

iqr = q75 - q25

lower_bound = q25 - 1.5 * iqr
upper_bound = q75 + 1.5 * iqr

autos = autos.loc[autos["price"].between(lower_bound, upper_bound)]

print("Lower bound:", lower_bound)
print("Upper bound:", upper_bound)
print("New shape:", autos.shape)
Lower bound: -7925.0
Upper bound: 16275.0
New shape: (343420, 17)

Now, there are 343,420 rows remaining in the dataset. This is still quite large; not too many datapoints were removed.

Let us view the new distribution of prices in a histogram.

ax = sns.histplot(
    data = autos,
    x = "price",
)

ax.set_title("Used Car Prices, Cleaned Using IQR")
ax.grid(True)

plt.show()

Unfortunately, since the lower bound was negative, unrealistically low prices close to 0 were still accepted within the range.

Remember from earlier that the 2nd most frequent price value was found to be 500 euros. This makes it likely that these were correct prices, rather than incorrectly scraped values. Thus, let us keep all price values greater than or equal to 500.

autos = autos.loc[autos["price"] >= 500]

autos.shape
(307358, 17)

Now, 307,358 datapoints remain. Let us look at the final distribution of prices:

ax = sns.histplot(
    data = autos,
    x = "price",
)

ax.set_title("Used Car Prices, Cleaned of Low Values")
ax.grid(True)

plt.show()

The distribution is still right-skewed, but at least the price range in the dataset is more reasonable now.

Metric Horsepower

We saw earlier that some power values were at 0 PS, and others were as high as 20,000 PS.

Note

A measure of 1 PS (metric horsepower) is equivalent to 0.98 HP (horsepower). (Deriquito 2020)

autos["power_ps"].describe()
count    307358.000000
mean        112.551842
std         191.557378
min           0.000000
25%          75.000000
50%         105.000000
75%         143.000000
max       20000.000000
Name: power_ps, dtype: float64

Above are the new descriptive statistics for metric horsepower, after the transformations that we have done recently.

According to “What is the Average Car Horsepower?” (2021), cars usually have a horsepower of 100 HP to 400 HP, though some have less. Faster cars have 400 HP to 1,000 HP, and some supercars have over 1,000 HP. The equivalent values in PS would be slightly higher.

Considering that these are used cars, it is unlikely for them to be very fast much less to be supercars. Thus, we will keep cars with metric horsepowers between 50 and 450, in order to take as many of the realistic values as possible.

autos = autos.loc[autos["power_ps"].between(50, 450)]

autos["power_ps"].describe()
count    274514.000000
mean        121.166931
std          50.535320
min          50.000000
25%          82.000000
50%         114.000000
75%         150.000000
max         450.000000
Name: power_ps, dtype: float64

Let us view the distribution of power values in a histogram.

ax = sns.histplot(
    data = autos,
    x = "power_ps",
)

ax.set_title("Metric Horsepower (PS) of Used Cars")
ax.grid(True)

plt.show()

Like the price data, the power data are right-skewed. Lower values are more frequent. Still, we managed to capture most (274,514) of the datapoints using this range.

Year of Registration

Earlier, we noticed that the years of registration ranged from 1000 to 9999.

autos["year_reg"].describe()
count    274514.000000
mean       2003.658039
std          29.529992
min        1000.000000
25%        1999.000000
50%        2004.000000
75%        2007.000000
max        9999.000000
Name: year_reg, dtype: float64

Logically, the years of registration should only range from when automobiles were first mass-produced, in the 1800s or 1900s (Encyclopedia Britannica), to the year when the ads were last seen by the web crawler.

We can know when the ads were last seen using the last_seen column of the dataset. The dates are shown as strings with years first, so we can sort them in descending order to find the most recent date.

autos["last_seen"].sort_values(ascending = False)
343211    2016-04-07 14:58:51
5069      2016-04-07 14:58:50
72231     2016-04-07 14:58:50
62402     2016-04-07 14:58:50
122068    2016-04-07 14:58:50
                 ...         
244077    2016-03-05 14:35:28
177326    2016-03-05 14:25:59
136842    2016-03-05 14:15:39
275196    2016-03-05 14:15:16
311225    2016-03-05 14:15:08
Name: last_seen, Length: 274514, dtype: object

It looks like all of the ads were last seen in 2016, so the cars could not have been registered beyond that year.

Therefore, a reasonable range of years would be from 1900 to 2016. Let us keep all rows within this range.

autos = autos.loc[autos["year_reg"].between(1900, 2016)]

autos["year_reg"].describe()
count    264977.000000
mean       2002.977877
std           6.097207
min        1910.000000
25%        1999.000000
50%        2003.000000
75%        2007.000000
max        2016.000000
Name: year_reg, dtype: float64

Now, the years of registration range from 1910 to 2016. There are also 264,977 datapoints remaining.

Months of Registration

Lastly, we noticed that the months of registration included integers from 0 to 12, not 1 to 12. Let us make a discrete histogram in order to understand the distribution.

ax = sns.histplot(
    data = autos,
    x = "month_reg",
    discrete = True,
)

ax.set_title("Used Cars' Months of Registration")
ax.grid(True)

plt.show()

The graph above shows that 0 is the least frequent value, appearing under 15,000 times. It seems very likely that it is a placeholder for unknown months. Therefore, we can remove rows with month 0.

autos = autos.loc[autos["month_reg"] >= 1]

autos["month_reg"].describe()
count    251198.000000
mean          6.396237
std           3.346424
min           1.000000
25%           4.000000
50%           6.000000
75%           9.000000
max          12.000000
Name: month_reg, dtype: float64

Missing Values

Now, we shall deal with the null values in the dataset. First, how many null values are there per column?

autos.isnull().sum()
date_crawled        0
name                0
price               0
abtest              0
vehicle_type     5871
year_reg            0
gearbox          3103
power_ps            0
model            7422
mileage_km          0
month_reg           0
fuel_type        9057
brand               0
damage          28718
date_created        0
postal_code         0
last_seen           0
dtype: int64

There are thousands of missing values for the vehicle type, gearbox, model, fuel type, and presence of damage. None of these can be easily determined from other columns.

If we were to remove all rows with these missing values, we would be left with:

autos.dropna().shape
(208538, 17)

We would have 208,538 rows left. This is just over half of the original number of datapoints we started with, 371,528.

This project is meant to showcase data cleaning, and we are assuming that all of the columns in the dataset will be used for analysis. Thus, we will have to delete all rows with missing values.

autos.dropna(inplace = True)

Below are the final descriptive statistics for the dataset after all of the cleaning.

autos.describe(include = "all")
date_crawled name price abtest vehicle_type year_reg gearbox power_ps model mileage_km month_reg fuel_type brand damage date_created postal_code last_seen
count 208538 208538 208538.000000 208538 208538 208538.000000 208538 208538.000000 208538 208538.000000 208538.000000 208538 208538 208538 208538 208538.000000 208538
unique 177458 114617 NaN 2 8 NaN 2 NaN 248 NaN NaN 7 39 2 98 NaN 116677
top 2016-04-02 22:54:55 BMW_318i NaN test limousine NaN manuell NaN golf NaN NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN 2016-04-07 09:45:41
freq 5 616 NaN 108258 62894 NaN 166213 NaN 17505 NaN NaN 136145 44157 189720 8271 NaN 14
mean NaN NaN 4984.662522 NaN NaN 2003.205953 NaN 122.635031 NaN 128250.990227 6.381225 NaN NaN NaN NaN 51907.756020 NaN
std NaN NaN 3990.028529 NaN NaN 5.696916 NaN 50.515054 NaN 35781.848172 3.350349 NaN NaN NaN NaN 25766.228717 NaN
min NaN NaN 500.000000 NaN NaN 1937.000000 NaN 50.000000 NaN 5000.000000 1.000000 NaN NaN NaN NaN 1067.000000 NaN
25% NaN NaN 1749.000000 NaN NaN 2000.000000 NaN 85.000000 NaN 125000.000000 3.000000 NaN NaN NaN NaN 31319.000000 NaN
50% NaN NaN 3750.000000 NaN NaN 2004.000000 NaN 116.000000 NaN 150000.000000 6.000000 NaN NaN NaN NaN 51371.000000 NaN
75% NaN NaN 7300.000000 NaN NaN 2007.000000 NaN 150.000000 NaN 150000.000000 9.000000 NaN NaN NaN NaN 72649.000000 NaN
max NaN NaN 16270.000000 NaN NaN 2016.000000 NaN 450.000000 NaN 150000.000000 12.000000 NaN NaN NaN NaN 99998.000000 NaN

Data Transformation

Strings to Datetime Objects

As we saw earlier, the date_crawled, date_created, and last_seen columns contain dates and times in string format. It would be better to store these as datetime objects so that each part (year, month, etc.) can be accessed in analysis.

For this section, I used “Convert the column type from string to datetime format in Pandas dataframe” (2020) as a reference.

First, though, we have to make a format string so that the numbers can be parsed properly.

autos[["date_crawled", "date_created", "last_seen"]].head()
date_crawled date_created last_seen
3 2016-03-17 16:54:04 2016-03-17 00:00:00 2016-03-17 17:40:17
4 2016-03-31 17:25:20 2016-03-31 00:00:00 2016-04-06 10:17:21
5 2016-04-04 17:36:23 2016-04-04 00:00:00 2016-04-06 19:17:07
6 2016-04-01 20:48:51 2016-04-01 00:00:00 2016-04-05 18:18:39
10 2016-03-26 19:54:18 2016-03-26 00:00:00 2016-04-06 10:45:34

All 3 columns seem to follow the same format:

{4 digit year}-{2 digit month}-{2 digit day} {24-hour time hour}:{minute}:{second}

The equivalent format string is below.

format_str = "%Y-%m-%d %H:%M:%S"

We can use this string to parse the dates. This can be done quickly by using the pd.to_datetime() vectorized function, instead of using a for-loop with the datetime module.

autos["date_crawled"] = pd.to_datetime(
    autos["date_crawled"],
    format = format_str,
)

autos["date_crawled"]
3        2016-03-17 16:54:04
4        2016-03-31 17:25:20
5        2016-04-04 17:36:23
6        2016-04-01 20:48:51
10       2016-03-26 19:54:18
                 ...        
371516   2016-04-04 09:57:12
371517   2016-03-28 13:48:07
371520   2016-03-19 19:53:49
371524   2016-03-05 19:56:21
371525   2016-03-19 18:57:12
Name: date_crawled, Length: 208538, dtype: datetime64[ns]

We can see that the values look similar to how they looked before, but the dtype is now datetime64[ns].

For example, we can now access the month and day from the datetime objects.

autos["date_crawled"].iloc[0].month
3
autos["date_crawled"].iloc[0].day
17

We can see that the date in the first row has a month of 3 (March) and a day of 17.

Let us do the same for the other 2 date columns.

autos["date_created"] = pd.to_datetime(
    autos["date_created"],
    format = format_str,
)

autos["last_seen"] = pd.to_datetime(
    autos["last_seen"],
    format = format_str,
)

autos[["date_crawled", "date_created", "last_seen"]].describe(
    datetime_is_numeric = True,
)
date_crawled date_created last_seen
count 208538 208538 208538
mean 2016-03-21 13:39:21.839721984 2016-03-20 19:44:12.604321536 2016-03-30 09:43:36.598633984
min 2016-03-05 14:06:23 2015-08-07 00:00:00 2016-03-05 14:15:16
25% 2016-03-13 11:06:28 2016-03-13 00:00:00 2016-03-23 20:52:41.500000
50% 2016-03-21 18:46:44.500000 2016-03-21 00:00:00 2016-04-04 14:39:49.500000
75% 2016-03-29 15:36:22.500000 2016-03-29 00:00:00 2016-04-06 11:45:07
max 2016-04-07 14:36:58 2016-04-07 00:00:00 2016-04-07 14:58:51

We can see that since the 3 columns are in datetime format, they are now treated as numerical values. They have a mean and a list of percentiles.

Data transformation is done, so we can move on to data analysis.

Data Analysis

Statistical analyses with hypothesis testing are outside of the scope of this project, since it is centered on data cleaning. Thus, I will perform some simple aggregates and visualize them in graphs.

Most Expensive Car Brands

Below, a bar graph shows each car brand and the median price. The mean is not used because we know that the price data are right-skewed, not normal.

plt.figure(figsize = (14, 5))

ax = sns.barplot(
    data = autos,
    x = "brand",
    y = "price",
    estimator = np.median,
    ci = None,
)

plt.title("Median Prices of Used Cars by Brand")
plt.grid(True)
plt.xticks(rotation = 90)

plt.show()

The graph shows that Porsche has the highest median price of used cars, at almost 12,000 euros. This is followed by Mini, Land Rover, and Jeep.

Used Car Prices by Age

The line plot below shows the relationship between a car’s year of registration (later -> newer) and the median price.

plt.figure(figsize = (14, 5))

ax = sns.lineplot(
    data = autos,
    x = "year_reg",
    y = "price",
    estimator = np.median,
    ci = None,
)

plt.title("Median Prices of Used Cars by Year of Registration")
plt.grid(True)
plt.xticks(rotation = 90)

plt.show()

The line plot shows that used car prices are lowest when the year of registration is in the mid-1990s.

To the right of the minimum, as the year of registration gets later, the median price steadily becomes higher. This makes sense because newer cars would have a higher value, even if they’re used.

On the other hand, to the left of the minimum, the median price increases (albeit erratically) as the year of registration gets earlier. This suggests that very old cars are considered to be valuable because they are rare. It is likely that the avid car collectors are the ones who pay such high prices for these cars.

Used Car Prices by Mileage

The line plot below shows the relationship between used cars’ mileage in kilometers and the median price.

plt.figure(figsize = (14, 5))

ax = sns.lineplot(
    data = autos,
    x = "mileage_km",
    y = "price",
    estimator = np.median,
    ci = None,
)

plt.title("Median Prices of Used Cars by Mileage in Kilometers")
plt.grid(True)
plt.xticks(rotation = 90)

plt.show()

As expected, price decreases as mileage increases. What is unexpected is that there is a sharp increase in price from around 500 km to 1000 km. This was likely caused by an outlier datapoint which had a low mileage and a low price.

Effect of Presence of Damage on Used Car Price

Lastly, we will analyze the effect of the presence of unrepaired damage on the median price of used cars.

ax = sns.barplot(
    data = autos,
    x = "damage",
    y = "price",
    estimator = np.median,
    ci = None,
)

plt.title("Effect of Presence of Damage on Median Used Car Price")
plt.grid(True)
plt.xticks(rotation = 90)

plt.show()

It can be seen that the median price of used cars without damage is 4000 euros. The median price is 1500 euros for those with unrepaired damage.

Therefore, cars with damage generally have lower prices. This makes sense because if the buyer plans to drive the car, they must shoulder the expense of repairing it.

Conclusion

In this project, we did cleaning, transformation, and simple analyses on data about used cars. Below are the insights we gained with regards to each research question. Note that prices were aggregated using the median, not the mean.

  • Which brands of the used cars are most expensive?

The top 4 most expensive brands of used cars are Porsche, Mini, Land Rover, and Jeep (descending order).


  • Does a used car’s age affect its price?

Yes. Cars registered in the mid-1990s have the lowest prices, with a median of 1000 euros.

As cars get newer from that point, their median price increases. As cars get older from the same point, their median price also increases. Very old cars are rare and therefore valuable to collectors.


  • Does a used car’s mileage affect its price?

Yes, as mileage increases, price decreases.


  • Does the presence of unrepaired damage affect a car’s price?

Yes, the median price of damaged used cars is much lower than that of cars without damage.



Thanks for reading!