onur baltacı
- 20 Tem 2022
- 3 dakikada okunur

Car Purchase Decision Classification

Güncelleme tarihi: 1 Haz 2023

In this project we will try to examine the car purchase with our data set which I took from kaggle and we will built a machine learning models with our data set. Data set link

Columns : User ID, Gender, Age, Annual Salary, Purchase Decision (No = 0; Yes = 1)

df.shape

We have 1000 rows and 5 columns in our data set

df.head()

We are not going to use User ID column so we can drop that column

df = df.drop('User ID', axis ="columns")

df["Gender"].value_counts()

We have 516 female and 484 male in our data set

df["Purchased"].value_counts()

598 customers didn't bought the car and 402 bought

sns.histplot(x = "AnnualSalary", data = df, hue = "Gender")

Seems like female customers earning more salary than male customers in our data set

sns.histplot(x = "Age", data = df, hue = "Gender")

Female customers are older than male customers in our data set

sns.histplot(x = "AnnualSalary", data = df, hue = "Purchased")

As it can expected, customers who has higher income more likely to buy the car

sns.histplot(x = "Age", data = df, hue = "Purchased")

Older customers are more likely to purchase. Now we can get dummy variables for gender column for training our classification models

df = pd.get_dummies(df, drop_first=False)
df = df.drop("Gender_Male",axis ="columns")

Now we can get correlation coefficient values on sales and our features

df.corr()["Purchased"].sort_values()

Age is the most correlated feature with our purchasement and annual salary follows it. It seems like gender is not that correlated with purchasement decision. Lets define our X and Y and split them into train and test set

X = df[["AnnualSalary","Age","Gender_Female"]].copy()
y = df[["Purchased"]].copy()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30,random_state=100)

We will scale our X sets

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.fit_transform(X_test)

We will define a function for comparising our models performances in terms of accuracy score

from sklearn.metrics import accuracy_score
def modelperformance(predictions):
  print("Accuracy score in model is {}".format(accuracy_score(y_test,predictions)))

Now we can start training classification models with logistic regression.

from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model.fit(scaled_X_train,y_train.values.ravel())
log_predictions = log_model.predict(scaled_X_test)
modelperformance(log_predictions)

Accuracy score in logistic regression is 0.79. Logistic regression performs well in this classification task. Now lets see how K Nearest Neighbors Classification performs. In order to decide we will use elbow method and grid search both and pick the best one from them. We will start with elbow method.

from sklearn.neighbors import KNeighborsClassifier
test_errors = []
for k in range(1,30):
  knn_model = KNeighborsClassifier(n_neighbors=k)
  knn_model.fit(scaled_X_train,y_train.values.ravel())
  knn_pred = knn_model.predict(scaled_X_test)
  test_error_rate = 1 - accuracy_score(y_test,knn_pred)
  test_errors.append(test_error_rate)
  plt.plot(range(1,30),test_errors)
  plt.ylabel("Error Rate")
  plt.xlabel("K Neighbors")

It seems like we get the lowest error with K = 10. Lets train our model with K = 10 and check our accuracy score

knn_elbow = KNeighborsClassifier(n_neighbors = 10)
knn_elbow.fit(scaled_X_train,y_train.values.ravel())
knn_pred = knn_model.predict(scaled_X_test)
modelperformance(knn_pred)

Our accuracy score with number of neighbors 10 is 0.91. Our KNN model performed well. Lets train support vector machines classification model

from sklearn.svm import SVC
svm = SVC()
param_grid_svr = {'C':[0.01,0.1,0.5,1],'kernel':['linear','rbf','poly']}
gridsvr = GridSearchCV(svm,param_grid_svr)
gridsvr.fit(scaled_X_train,y_train.values.ravel())
pred_svr = gridsvr.predict(scaled_X_test)
modelperformance(pred_svr)

Accuracy score is 0.89. Lets train decision tree classification model

from sklearn.tree import DecisionTreeClassifier
treemodel = DecisionTreeClassifier()
treemodel.fit(scaled_X_train,y_train.values.ravel())
treepred = treemodel.predict(scaled_X_test)
modelperformance(treepred)

Accuracy score is 0.87. Lets train random forest classification model

from sklearn.ensemble import RandomForestClassifier
rfr_model = RandomForestClassifier()
n_estimators = [32,64,128,256]
max_features = [2,3,4]
bootstrap = [True,False]
oob_score = [True,False]
param_grid_rfr = {'n_estimators':n_estimators,'max_features':max_features,'bootstrap':bootstrap,'oob_score':oob_score}
grid_rfr = GridSearchCV(rfr_model,param_grid_rfr)
grid_rfr.fit(scaled_X_train,y_train.values.ravel())
print(grid_rfr.best_params_)

And train our model with hyperparameters which we found from grid search

rfc = RandomForestClassifier(max_features = 3, n_estimators = 256, oob_score=True)
rfc.fit(scaled_X_train,y_train.values.ravel())
predsrfc = rfc.predict(scaled_X_test)
modelperformance(predsrfc)

Accuracy score is 0.87. Best performing classification model was k-nearest neighbors algorithm with K = 10 which we found with elbow method.

Car Purchase Decision Classification

Son Yazılar