top of page

Analysis of Disneyland Reviews

Hello, welcome to my new blog post. In this post, we are going to analyze and visualize the Disneyland reviews using Python. After that we will build a text classificaiton model in order to predict new reviews. The data set is taken from the following link : Data-link. We will start by reading our data and checking the first 5 entry.


data = pd.read_csv("DisneylandReviews.csv",encoding='latin-1')
data.head() 


As we can see we have the columns like rating, review text and branch. Lets see how many rows we have.


print("Number of rows in data : {}".format(data.shape[0]))

Output is "Number of rows in data: 42656". Lets see the value counts of ratings.


  data["Rating"].value_counts(ascending = True)

Value counts are: 1- 1499 / 2- 2127 / 3- 5109 / 4- 10775 / 5- 23146. We can say that Disneylands are mostly liked according to our data set. Lets visualize our data.


data['Rating'].value_counts(ascending = True).plot(kind = 'bar', title = 'Count by Ratings', figsize = (20,12),color = "Purple")
plot.set_xlabel('Stars')
plt.show()


We can see visually that positive reviews are much more. Lets see review count for each branch.


sns.countplot(y = "Branch", data=data,
                   facecolor=(1,1,1,1),
                   linewidth=6,
                   edgecolor=sns.color_palette("dark", 4))


The most of the reviews are made for Disneyland california. Lets see which Disneyland is mostly liked.


sns.barplot(y=data['Branch'],x=data['Rating'])



Disneyland california is the most liked one. Lets remove the columns that we won't use and change the sorting of the ones that we are going to use.


data.drop(data.columns[[0,2,3,5]], axis = 1, inplace=True)
data = data[["Review_Text","Rating"]]
data


Now our data looks like that we want. Lets check for na valued and duplicated rows.


data.isnull().sum()

There is no na valued rows. Lets check duplicated ones.


data.duplicated().sum()

We have 23 duplicated rows. Lets remove them.


data.drop_duplicates(inplace = True)

Now what we are going to do is we are going to write a for loop for checking the blank entries. In text data like this, there can be empty reviews which won't show itself in isnull() method. Lets see it.


blanks = []
for index in data["Review_Text"]:
  if index.isspace() == True:
    blanks.append(index)
print(blanks)

Our list is empty, there is no empty reviews. Lets assign X and y, and seperate the data with train test split.


X = data["Review_Text"]
y = data["Rating_posorneg"]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33)  

And we can train our model with train data now.


textclf = Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())])
textclf.fit(X_train,y_train)

Now we can predict the y_test values with X_test data and store them in a variable. Then we will compare it with our model and we will see how our model performs.


preds = textclf.predict(X_test)
print(accuracy_score(y_test,preds))

Our accuracy score from the model is 0.93 which is a great score. Lets visualize the results.


matplotlib.rc('figure', figsize=(20, 10))
cm_display = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = [False, True])
cm_display.plot()
plt.show()


And we can try to predict a new review with our model now.


newreview = [("Disneyland was perfect. I liked it so much")]
textclf.predict(newreview)

Outcome is "Positive". Now lets try one more.


newreview = [("I didn't like the Disneyland. I won't visit again.")]
textclf.predict(newreview)

Outcome is "Negative". We can see our model works perfectly. Thanks for reading, have a nice day :)



46 görüntüleme0 yorum

Son Yazılar

Hepsini Gör
bottom of page