onur baltacı
Car Purchase Decision Classification
Güncelleme tarihi: 1 Haz
In this project we will try to examine the car purchase with our data set which I took from kaggle and we will built a machine learning models with our data set. Data set link
Columns : User ID, Gender, Age, Annual Salary, Purchase Decision (No = 0; Yes = 1)
df.shape
We have 1000 rows and 5 columns in our data set
df.head()

We are not going to use User ID column so we can drop that column
df = df.drop('User ID', axis ="columns")
df["Gender"].value_counts()
We have 516 female and 484 male in our data set
df["Purchased"].value_counts()
598 customers didn't bought the car and 402 bought
sns.histplot(x = "AnnualSalary", data = df, hue = "Gender")

Seems like female customers earning more salary than male customers in our data set
sns.histplot(x = "Age", data = df, hue = "Gender")

Female customers are older than male customers in our data set
sns.histplot(x = "AnnualSalary", data = df, hue = "Purchased")

As it can expected, customers who has higher income more likely to buy the car
sns.histplot(x = "Age", data = df, hue = "Purchased")

Older customers are more likely to purchase. Now we can get dummy variables for gender column for training our classification models
df = pd.get_dummies(df, drop_first=False)
df = df.drop("Gender_Male",axis ="columns")
Now we can get correlation coefficient values on sales and our features
df.corr()["Purchased"].sort_values()

Age is the most correlated feature with our purchasement and annual salary follows it. It seems like gender is not that correlated with purchasement decision. Lets define our X and Y and split them into train and test set
X = df[["AnnualSalary","Age","Gender_Female"]].copy()
y = df[["Purchased"]].copy()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30,random_state=100)
We will scale our X sets
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.fit_transform(X_test)
We will define a function for comparising our models performances in terms of accuracy score
from sklearn.metrics import accuracy_score
def modelperformance(predictions):
print("Accuracy score in model is {}".format(accuracy_score(y_test,predictions)))
Now we can start training classification models with logistic regression.
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model.fit(scaled_X_train,y_train.values.ravel())
log_predictions = log_model.predict(scaled_X_test)
modelperformance(log_predictions)
Accuracy score in logistic regression is 0.79. Logistic regression performs well in this classification task. Now lets see how K Nearest Neighbors Classification performs. In order to decide we will use elbow method and grid search both and pick the best one from them. We will start with elbow method.
from sklearn.neighbors import KNeighborsClassifier
test_errors = []
for k in range(1,30):
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(scaled_X_train,y_train.values.ravel())
knn_pred = knn_model.predict(scaled_X_test)
test_error_rate = 1 - accuracy_score(y_test,knn_pred)
test_errors.append(test_error_rate)
plt.plot(range(1,30),test_errors)
plt.ylabel("Error Rate")
plt.xlabel("K Neighbors")

It seems like we get the lowest error with K = 10. Lets train our model with K = 10 and check our accuracy score
knn_elbow = KNeighborsClassifier(n_neighbors = 10)
knn_elbow.fit(scaled_X_train,y_train.values.ravel())
knn_pred = knn_model.predict(scaled_X_test)
modelperformance(knn_pred)
Our accuracy score with number of neighbors 10 is 0.91. Our KNN model performed well. Lets train support vector machines classification model
from sklearn.svm import SVC
svm = SVC()
param_grid_svr = {'C':[0.01,0.1,0.5,1],'kernel':['linear','rbf','poly']}
gridsvr = GridSearchCV(svm,param_grid_svr)
gridsvr.fit(scaled_X_train,y_train.values.ravel())
pred_svr = gridsvr.predict(scaled_X_test)
modelperformance(pred_svr)
Accuracy score is 0.89. Lets train decision tree classification model
from sklearn.tree import DecisionTreeClassifier
treemodel = DecisionTreeClassifier()
treemodel.fit(scaled_X_train,y_train.values.ravel())
treepred = treemodel.predict(scaled_X_test)
modelperformance(treepred)
Accuracy score is 0.87. Lets train random forest classification model
from sklearn.ensemble import RandomForestClassifier
rfr_model = RandomForestClassifier()
n_estimators = [32,64,128,256]
max_features = [2,3,4]
bootstrap = [True,False]
oob_score = [True,False]
param_grid_rfr = {'n_estimators':n_estimators,'max_features':max_features,'bootstrap':bootstrap,'oob_score':oob_score}
grid_rfr = GridSearchCV(rfr_model,param_grid_rfr)
grid_rfr.fit(scaled_X_train,y_train.values.ravel())
print(grid_rfr.best_params_)
And train our model with hyperparameters which we found from grid search
rfc = RandomForestClassifier(max_features = 3, n_estimators = 256, oob_score=True)
rfc.fit(scaled_X_train,y_train.values.ravel())
predsrfc = rfc.predict(scaled_X_test)
modelperformance(predsrfc)
Accuracy score is 0.87. Best performing classification model was k-nearest neighbors algorithm with K = 10 which we found with elbow method.