Skip to the content.

Drive the Future: Predicting Used Car Prices

Objective

This repository offers an in-depth analysis of a reduced-size Kaggle dataset containing information on 426,000 used cars. Its objective is to identify the factors influencing used car pricing through exploratory data analysis and machine learning models, following the CRISP-DM framework https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining. Accurate predictions of used car prices aim to benefit stakeholders, including car dealerships, individual sellers and buyers, and financial institutions.

Executive Summary

Key Findings

Key features influence the price of used cars. While there is no linear relationship between the price of a used car and other features, non-linear relationships, including polynomial features of independent variables, significantly influence the target price of used cars.

The features ranked by permutation importance include the combined impact of price and year of manufacture, followed by the year of manufacture alone. Other significant factors are the number of cylinders, the interactions between year and cylinders, year and odometer, price and odometer, and price and cylinders. Additionally, fuel type (diesel and gas), specific states such as Oklahoma, truck models, manufacturers like Mitsubishi and Porsche, and transmission types (automatic and manual), especially in sedan variants, also play important roles.

Insights and Implications

It is noteworthy that electric cars, paint colors, and a clean title status were not found to significantly influence the price of cars within the studied years between 2000 to 2020.

Interesting Findings

Feature Recommendation Feature Coefficient Impact on used Car price Interpretation
Price and Year Consider Pricing Strategies that consider both features comprehensively 10841.8756 (price and year combined) 3880.2149 (year only) Positive influence Year of manufacture tends to increase used car price
Cylinders Promote higher cylinder car models in the inventory as they are perceived to offer better performance with increased power 3207.6857 Positive influence Higher number of cylinders increases used car price
Odometer and Year Promote models in inventory that have a lower mileage with a recent year of manufacture as they are perceived to have a longer potential lifespan 463.6157 Positive influence Lower mileage on a relatively recent car increases its price
Fuel diesel With better fuel efficiency, diesel cars in the inventory will attact buyers looking to save on fuel cost over the long term 3.1872 Positive influence Higher demand for diesel cars results in higher resale price for a used car
Type Truck with drive 4wd, Convertible Increase the inventory with trucks that offer off-road performance and convertibles with their seasonal appeal 1.9728 0.8858 Drive 4wd- 0.5652 Positive influence Features that contribute to higher resale prices
State Ok, Ar Research regional demand and local market before pricing cars in inventory 1.8968 0.7263 Positive influence Regional preferences increase used car prices
Manufacturer – Mitsubhishi, Aston Martin, Porsche With perceived reliability and lower maintenance, having branded and luxury cars in the inventory will increases sales 1.4486 1.3874 1.1066 Positive influence Brands that result in appreciation of used car prices
Year and Cylinders, Higher odometer mileage, Odometer and Cylinders Older cars with lower cylinders and higher mileage tend to sell less. Consider careful pricing strategies also factoring in state/region with its demand -3201.4618 -462.7157 -26.1842 Negative influence Older cars with higher mileage and low cylinder count result in depreciation of used car prices
Automatic Transmission, Fuel type of gas and hybrids, Sedan Type Consider including these depreciating features in your pricing strategy. Offer incentives and limited time promotions can make these used cars attractive to buyers increasing sales. -1.3445 Gas -2.5235 Hybrids -0.8815 Sedan type -1.0686 Negative influence Features that result in depreciation of used car prices
Manufacturer – Hyundai, Nisaan Factor in brands that decrease resale prices when pricing the inventory -0.0027 -0.0053 Negative influence Brands that result in depreciation of used car prices

A graph of a negative impact Description automatically generatedA graph of a positive impact Description automatically generated

Recommendations

Based on features identified as having higher importance in influencing the price of used cars, here are some recommendations.

Continue reading for an in-depth analysis of the model evaluation that underpins the findings and recommendations discussed.

GitHub Repository Structure

Deep Dives: Unveiling Insights with the CRISP-DM Framework

Business Use Case: Predicting Used Car Prices

The used car market is highly dynamic, with prices influenced by numerous factors such as vehicle year of manufacture, odometer mileage, condition, make and model, location, and market trends. A used car prediction model helps with the following objectives:

Data Understanding

Dataset Description

The original Kaggle data-set of 3 million used cars was reduced to a 426880 entries to improve speed with data processing. This data-set with 18 features includes –

  1. id: A unique identifier for each car listing.

  2. region: The geographic region where the car is listed.

  3. price: The listed price of the car in dollars.

  4. year: The manufacturing year of the car.

  5. manufacturer: The manufacturer or brand of the car (Ford, Dodge etc).

  6. model: The model name of the car.

  7. condition: The condition of the car (New, Good, Fair, Salvage).

  8. cylinders: The number of cylinders in the car’s engine.

  9. fuel: The type of fuel the car uses (Gas, Diesel, Electric, Hybrid).

  10. odometer: The mileage of the car which is distance traveled in miles.

  11. title_status: The status of the car’s title (Clean, Salvage, Rebuilt).

  12. transmission: The type of transmission (Automatic, Transmission, Manual).

  13. VIN: The Vehicle Identification Number, a unique code used to identify individual motor vehicles.

  14. drive: The type of drivetrain (4wd, fwd, rwd).

  15. size: The size category of the car (Compact, Mid-size, Full-size).

  16. type: The type or category of the car (Sedan, SUV, truck).

  17. paint_color: The exterior color of the car’s paint.

  18. state: The state where the car is listed.

Dataset Exploration

Explored distribution of Numeric Features

Observations

A graph of a number of objects Description automatically generated with medium confidence

Explored distributions of all Categorical Features

Observations with unique values

Observations with Distribution of Year and Price of used cars across category features

With a 10K sample size, there were no visible linear relations between Car Price and Year of Manufacture with any of its categorical features or numerical features

A graph of a graph showing a number of blue dots Description automatically generated with medium confidence

A graph showing the difference between used cars Description automatically generated

A graph of different colored dots Description automatically generatedA screen shot of a graph Description automatically generated

A graph of colored dots Description automatically generatedA graph showing different colored dots Description automatically generatedA graph of blue dots Description automatically generated with medium confidenceA graph showing a number of dots Description automatically generated with medium confidenceA graph showing different colored dots Description automatically generatedA graph showing different colored dots Description automatically generated

Data Preparation and Preprocessing

This section walks us through step wise data preparation and processing steps that were performed.

To address the skewed price distribution, retained prices between 2K – 80K USD. This effectively removed prices outside the selected lower and upper bound. DataSet size reduced from 426880 entries to about 372276 entries. This normalized the price, distributions as seen in plot below.

A graph of a car sales Description automatically generated with medium confidence

A graph of a distribution of price Description automatically generated

A graph with blue bars Description automatically generated

Model Building

With the used car dataset preprocessed and its features transformed, evaluated models with Ridge and Lasso Regression together with PCA and K-Means clustering combined with Ridge Regression. Ridge and Lasso Regression models performed better, both when a) using a subset of highly correlated features to Price, b) using all 157 features.

Models considered for evaluation

Model #2 and Model #5 regression models were trained and tested with a 80/20 data-set split. Hyper-parameters (alpha) were fine tuned with Grid Search combined with a 5-fold cross validation to optimize model performance and prevent over-fitting.

Best Model

The Ridge regression model (Model #2) using ALL features, with a Test MSE = 286.8824 and a Train MSE = 265.29658.

Models discarded for evaluation

Conclusion

We concluded that feature selection techniques combined with regularization, such as Lasso Regression, outperform dimensionality reduction techniques like PCA when predicting used car prices. The regularization provided by Ridge and Lasso effectively selected the most relevant features, leading to better model performance compared to using PCA for reducing dimensionality.

Model Evaluation of Best Model with Ridge Regression

This best model performs well in predicting used car prices, exhibiting moderate variance.

With the Train and Test MSE both low and close to each other, it indicates that the model strikes a good balance between bias and variance, with no significant overfitting or underfitting.

Using 5-fold cross-validation, the best Train and best Test MSE values for the k-folds are also close to each other, further confirming the model’s robustness and generalizability.

Learning curves also assist in visualizing how this model trades-off between bias and variance across different training set sizes. The plot demonstrates a pattern where both training and validation errors decrease and eventually converge, suggesting the model’s generalizes well with reduced variance.

Evaluation Metrics

A graph with orange lines Description automatically generated A close-up of a white screen Description automatically generated

Learning Curves

A graph of a graph showing the difference between a line and a line Description automatically generated with medium confidence

Actual vs Predicted Prices

A graph showing a line Description automatically generated with medium confidence

Conclusion

The best model evaluation provided several key insights into the features that influence used car prices. These are outlined in the Summary section at the start of this report.

Future Work