top of page
  • Yazarın fotoğrafıonur baltacı

Car Purchase Decision Classification

Güncelleme tarihi: 7 Mar

In this project we will try to examine the car purchase with our data set which I took from kaggle and we will built a machine learning models with our data set. Data set link

Columns : User ID, Gender, Age, Annual Salary, Purchase Decision (No = 0; Yes = 1)


We have 1000 rows and 5 columns in our data set


We are not going to use User ID column so we can drop that column

df = df.drop('User ID', axis ="columns")


We have 516 female and 484 male in our data set


598 customers didn't bought the car and 402 bought

sns.histplot(x = "AnnualSalary", data = df, hue = "Gender")

Seems like female customers earning more salary than male customers in our data set

sns.histplot(x = "Age", data = df, hue = "Gender")

Female customers are older than male customers in our data set

sns.histplot(x = "AnnualSalary", data = df, hue = "Purchased")

As it can expected, customers who has higher income more likely to buy the car

sns.histplot(x = "Age", data = df, hue = "Purchased")

Older customers are more likely to purchase. Now we can get dummy variables for gender column for building our classification models

df = pd.get_dummies(df, drop_first=False)
df = df.drop("Gender_Male",axis ="columns")

Now we can get correlation coefficient values on sales and our features


Age is the most correlated feature with our purchasement and annual salary follows it. It seems like gender is not that correlated with purchasement decision. Lets define our X and Y and split them into train and test set

X = df[["AnnualSalary","Age","Gender_Female"]].copy()
y = df[["Purchased"]].copy()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30,random_state=100)

We will scale our X sets

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.fit_transform(X_test)

We will define a function for comparising our models performances in terms of accuracy score

from sklearn.metrics import accuracy_score
def modelperformance(predictions):
  print("Accuracy score in model is {}".format(accuracy_score(y_test,predictions)))

Now we can start building classification models with logistic regression.

from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(),y_train.values.ravel())
log_predictions = log_model.predict(scaled_X_test)

Accuracy score in logistic regression is 0.79. Logistic regression performs well in this classification task. Now lets see how K Nearest Neighbors Classification performs. In order to decide we will use elbow method and grid search both and pick the best one from them. We will start with elbow method.

from sklearn.neighbors import KNeighborsClassifier
test_errors = []
for k in range(1,30):
  knn_model = KNeighborsClassifier(n_neighbors=k),y_train.values.ravel())
  knn_pred = knn_model.predict(scaled_X_test)
  test_error_rate = 1 - accuracy_score(y_test,knn_pred)
  plt.ylabel("Error Rate")
  plt.xlabel("K Neighbors")

It seems like we get the lowest error with K = 10. Lets build our model with K = 10 and check our accuracy score

knn_elbow = KNeighborsClassifier(n_neighbors = 10),y_train.values.ravel())
knn_pred = knn_model.predict(scaled_X_test)

Our accuracy score with number of neighbors 10 is 0.91. Our KNN model performed well. Lets build support vector machines classification model

from sklearn.svm import SVC
svm = SVC()
param_grid_svr = {'C':[0.01,0.1,0.5,1],'kernel':['linear','rbf','poly']}
gridsvr = GridSearchCV(svm,param_grid_svr),y_train.values.ravel())
pred_svr = gridsvr.predict(scaled_X_test)

Accuracy score is 0.89. Lets build decision tree classification model

from sklearn.tree import DecisionTreeClassifier
treemodel = DecisionTreeClassifier(),y_train.values.ravel())
treepred = treemodel.predict(scaled_X_test)

Accuracy score is 0.87. Lets build random forest classification model

from sklearn.ensemble import RandomForestClassifier
rfr_model = RandomForestClassifier()
n_estimators = [32,64,128,256]
max_features = [2,3,4]
bootstrap = [True,False]
oob_score = [True,False]
param_grid_rfr = {'n_estimators':n_estimators,'max_features':max_features,'bootstrap':bootstrap,'oob_score':oob_score}
grid_rfr = GridSearchCV(rfr_model,param_grid_rfr),y_train.values.ravel())

And build our model with hyperparameters which we found from grid search

rfc = RandomForestClassifier(max_features = 3, n_estimators = 256, oob_score=True),y_train.values.ravel())
predsrfc = rfc.predict(scaled_X_test)

Accuracy score is 0.87. Best performing classification model was k-nearest neighbors algorithm with K = 10 which we found with elbow method.

3 görüntüleme0 yorum

Son Paylaşımlar

Hepsini Gör
bottom of page