top of page

Car Purchase Decision Classification

Güncelleme tarihi: 1 Haz 2023



In this project we will try to examine the car purchase with our data set which I took from kaggle and we will built a machine learning models with our data set. Data set link

Columns : User ID, Gender, Age, Annual Salary, Purchase Decision (No = 0; Yes = 1)


df.shape

We have 1000 rows and 5 columns in our data set


df.head()


We are not going to use User ID column so we can drop that column


df = df.drop('User ID', axis ="columns")

df["Gender"].value_counts()

We have 516 female and 484 male in our data set


df["Purchased"].value_counts()

598 customers didn't bought the car and 402 bought


sns.histplot(x = "AnnualSalary", data = df, hue = "Gender")


Seems like female customers earning more salary than male customers in our data set


sns.histplot(x = "Age", data = df, hue = "Gender")


Female customers are older than male customers in our data set


sns.histplot(x = "AnnualSalary", data = df, hue = "Purchased")


As it can expected, customers who has higher income more likely to buy the car


sns.histplot(x = "Age", data = df, hue = "Purchased")


Older customers are more likely to purchase. Now we can get dummy variables for gender column for training our classification models


df = pd.get_dummies(df, drop_first=False)
df = df.drop("Gender_Male",axis ="columns")

Now we can get correlation coefficient values on sales and our features


df.corr()["Purchased"].sort_values()


Age is the most correlated feature with our purchasement and annual salary follows it. It seems like gender is not that correlated with purchasement decision. Lets define our X and Y and split them into train and test set


X = df[["AnnualSalary","Age","Gender_Female"]].copy()
y = df[["Purchased"]].copy()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30,random_state=100)

We will scale our X sets


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.fit_transform(X_test)

We will define a function for comparising our models performances in terms of accuracy score


from sklearn.metrics import accuracy_score
def modelperformance(predictions):
  print("Accuracy score in model is {}".format(accuracy_score(y_test,predictions)))

Now we can start training classification models with logistic regression.


from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model.fit(scaled_X_train,y_train.values.ravel())
log_predictions = log_model.predict(scaled_X_test)
modelperformance(log_predictions)

Accuracy score in logistic regression is 0.79. Logistic regression performs well in this classification task. Now lets see how K Nearest Neighbors Classification performs. In order to decide we will use elbow method and grid search both and pick the best one from them. We will start with elbow method.


from sklearn.neighbors import KNeighborsClassifier
test_errors = []
for k in range(1,30):
  knn_model = KNeighborsClassifier(n_neighbors=k)
  knn_model.fit(scaled_X_train,y_train.values.ravel())
  knn_pred = knn_model.predict(scaled_X_test)
  test_error_rate = 1 - accuracy_score(y_test,knn_pred)
  test_errors.append(test_error_rate)
  plt.plot(range(1,30),test_errors)
  plt.ylabel("Error Rate")
  plt.xlabel("K Neighbors")


It seems like we get the lowest error with K = 10. Lets train our model with K = 10 and check our accuracy score


knn_elbow = KNeighborsClassifier(n_neighbors = 10)
knn_elbow.fit(scaled_X_train,y_train.values.ravel())
knn_pred = knn_model.predict(scaled_X_test)
modelperformance(knn_pred)

Our accuracy score with number of neighbors 10 is 0.91. Our KNN model performed well. Lets train support vector machines classification model


from sklearn.svm import SVC
svm = SVC()
param_grid_svr = {'C':[0.01,0.1,0.5,1],'kernel':['linear','rbf','poly']}
gridsvr = GridSearchCV(svm,param_grid_svr)
gridsvr.fit(scaled_X_train,y_train.values.ravel())
pred_svr = gridsvr.predict(scaled_X_test)
modelperformance(pred_svr)

Accuracy score is 0.89. Lets train decision tree classification model


from sklearn.tree import DecisionTreeClassifier
treemodel = DecisionTreeClassifier()
treemodel.fit(scaled_X_train,y_train.values.ravel())
treepred = treemodel.predict(scaled_X_test)
modelperformance(treepred)

Accuracy score is 0.87. Lets train random forest classification model


from sklearn.ensemble import RandomForestClassifier
rfr_model = RandomForestClassifier()
n_estimators = [32,64,128,256]
max_features = [2,3,4]
bootstrap = [True,False]
oob_score = [True,False]
param_grid_rfr = {'n_estimators':n_estimators,'max_features':max_features,'bootstrap':bootstrap,'oob_score':oob_score}
grid_rfr = GridSearchCV(rfr_model,param_grid_rfr)
grid_rfr.fit(scaled_X_train,y_train.values.ravel())
print(grid_rfr.best_params_)

And train our model with hyperparameters which we found from grid search


rfc = RandomForestClassifier(max_features = 3, n_estimators = 256, oob_score=True)
rfc.fit(scaled_X_train,y_train.values.ravel())
predsrfc = rfc.predict(scaled_X_test)
modelperformance(predsrfc)

Accuracy score is 0.87. Best performing classification model was k-nearest neighbors algorithm with K = 10 which we found with elbow method.


45 görüntüleme0 yorum

Son Yazılar

Hepsini Gör
bottom of page