For this section of my project, I used Tweets data since SVM is best used for data with high dimensionality, including text data. The Tweets data were previously gathered via API and contain the keywords "Apple", "Home Depot", and "Goldman Sachs".
To prepare my data for SVM, I ran the Count Vectorizer function on the text data, like I did to train the Naive Bayes model previously, to select features. When the data was vectorized, I divided the data into variables X and y: X consists of each individual word, while y is the list of companies. I split a portion (80%) of the X and y into the "training" category and a smaller portion (20%) into the "test" category.
Support vector machine, commonly abbreviated as SVM, is an algorithm that finds a hyperplane in an N-dimensional space that classifies all the data points. In this definiton, N is the number of features of the data. The objective of SVM is finding the best kernel and hyperplane that would give us the greatest distance between the different classes, and hence the most accurate predictions. There are four types of kernels, or sets of mathematical functions, that we will use to find the highest quality fit––they are linear, polynomial, sigmoid, and radial basis function (rbf). As the data contains 3 labels––"Apple", "Home Depot", and "Goldman Sachs––the SVM is computing 2 hyperplanes intead of only 1.
# Load essential libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
# Read csv files as df
df1 = pd.read_csv('../data/cleaned_goldman_sachs_tweets.csv', on_bad_lines='skip')
df2 = pd.read_csv('../data/cleaned_apple_tweets.csv', on_bad_lines='skip')
df3 = pd.read_csv('../data/cleaned_home_depot_tweets.csv', on_bad_lines='skip')
# Print the head of each df
print(df1.head())
print(df2.head())
print(df3.head())
goldman 0 sachs 1 laporta 2 tomorrow 3 goldman 4 sachs email 0 want 1 something 2 must 3 approach 4 slowly koch 0 industry 1 american 2 crystal 3 sugar 4 home
# X
# Push column name of df1 into values
df1.loc[-1] = df1.columns.values
df1.sort_index(inplace=True)
df1.reset_index(drop=True, inplace=True)
df1.rename(columns={"goldman": "tweets"}, inplace=True) #rename column
df1['company'] = 'goldman_sachs' #add new column with company label
print(df1)
# Do the same for df2
df2.loc[-1] = df2.columns.values
df2.sort_index(inplace=True)
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={"email": "tweets"}, inplace=True) #rename column
df2['company'] = 'apple' #add new column with company label
print(df2)
# Do the same for df3
df3.loc[-1] = df3.columns.values
df3.sort_index(inplace=True)
df3.reset_index(drop=True, inplace=True)
df3.rename(columns={"koch": "tweets"}, inplace=True) #rename column
df3['company'] = 'home_depot' #add new column with company label
print(df3)
# Concat dataframes
df = pd.concat([df1, df2, df3])
print(df.head())
tweets company 0 goldman goldman_sachs 1 sachs goldman_sachs 2 laporta goldman_sachs 3 tomorrow goldman_sachs 4 goldman goldman_sachs ... ... ... 1977 uk goldman_sachs 1978 efc goldman_sachs 1979 catskillfishing goldman_sachs 1980 catskillfishing goldman_sachs 1981 catskillfishing goldman_sachs [1982 rows x 2 columns] tweets company 0 email apple 1 want apple 2 something apple 3 must apple 4 approach apple ... ... ... 1426 razak apple 1427 nata apple 1428 geoff apple 1429 nata apple 1430 chris apple [1431 rows x 2 columns] tweets company 0 koch home_depot 1 industry home_depot 2 american home_depot 3 crystal home_depot 4 sugar home_depot ... ... ... 1923 lmleeminki home_depot 1924 w home_depot 1925 kurtishanni home_depot 1926 kurtishanni home_depot 1927 edealinfousa home_depot [1928 rows x 2 columns] tweets company 0 goldman goldman_sachs 1 sachs goldman_sachs 2 laporta goldman_sachs 3 tomorrow goldman_sachs 4 goldman goldman_sachs
The classes that we are trying to predict with the SVM model are "goldman_sachs", "apple", and "home_depot", which are 1, 2, and 0 respectively. It looks like the class distribution is slightly uneven, with "goldman_sachs" containing more words than "apple", both of which have higher counts than "home_depot". I will keep this slight imbalance in mind when analyzing results, since the accuracy for goldman_sachs and apple labels might be higher than that of home_depot since there are more data for the first two companies to begin with.
## Examine class distribution
sns.set(font_scale=1.2)
# Visualize class distribution
df['company'] = pd.factorize(df.company, sort=True)[0]
fig, ax1 = plt.subplots(ncols=1, figsize=(10,6))
df['company'].value_counts().plot(kind='bar',
rot=0,
fontsize=15,
ax=ax1)
ax1.set_title('Distribution of Tweets', fontsize=20)
ax1.set_xlabel('Tweet Label', fontsize=15)
ax1.set_ylabel('Number of Words', fontsize=15)
# save picture
fig1 = ax1.get_figure()
fig1.savefig("../501-project-website/images/SVM_tweet_labels_distribution.png")
To understand how accurate my upcoming model is, I created a baseline model using a uniform random number generation process. Since in the actual decision tree model the result has 3 categories, we input 3 different labels for this baseline model. The classifier's accuracy, precision, recall, and fscores are below.
## Baseline model
import random
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
def generate_label_data(class_labels, weights,N=10000):
y=random.choices(class_labels, weights = weights, k = N)
print("-----GENERATING DATA-----")
print("unique entries:",Counter(y).keys())
print("count of labels:",Counter(y).values()) # counts the elements' frequency
print("probability of labels:",np.fromiter(Counter(y).values(), dtype=float)/len(y)) # counts the elements' frequency
return y
## Generate random classifier
def random_classifier(y_data):
ypred=[];
max_label=np.max(y_data); #print(max_label)
for i in range(0,len(y_data)):
ypred.append(int(np.floor((max_label+1)*np.random.uniform(0,1))))
print("-----RANDOM CLASSIFIER-----")
print("count of prediction:",Counter(ypred).values()) # counts the elements' frequency
print("probability of prediction:",np.fromiter(Counter(ypred).values(), dtype=float)/len(y_data)) # counts the elements' frequency
print("accuracy",accuracy_score(y_data, ypred))
print("percision, recall, fscore,",precision_recall_fscore_support(y_data, ypred))
# Random classifier
print("\MULTI-CLASS: UNIFORM LOAD")
y=generate_label_data([0,1,2],[1/3, 1/3, 1/3],10000)
random_classifier(y)
\MULTI-CLASS: UNIFORM LOAD -----GENERATING DATA----- unique entries: dict_keys([2, 1, 0]) count of labels: dict_values([3357, 3340, 3303]) probability of labels: [0.3357 0.334 0.3303] -----RANDOM CLASSIFIER----- count of prediction: dict_values([3266, 3380, 3354]) probability of prediction: [0.3266 0.338 0.3354] accuracy 0.3375 percision, recall, fscore, (array([0.33461538, 0.34292713, 0.33512224]), array([0.34241599, 0.33532934, 0.33482276]), array([0.33847075, 0.33908568, 0.33497243]), array([3303, 3340, 3357]))
Since I am dealing with text data, there are not many features I need to select from. Rows represent one of the three company names that a Tweet mentions, while columns represent a particular word. Preiviously in the data cleaning stage, many words were filtered out in order to end up with the current set of words or features that we have. I filtered out stop words in the English dictionary (such as "I", "an", "the"), hashtags, punctuation, and symbols such as emojis that do not hold significant meaning for the purpose of this analysis. This process helped me select the features that I currently have.
# X
corpus=df["tweets"].to_list()
# Initialize count vectorizer
# Ignore terms that appear in less than 1% of the documents
vectorizer=CountVectorizer(min_df=0.01)
# Run count vectorizer on corpus
Xs = vectorizer.fit_transform(corpus)
X=np.array(Xs.todense())
# Convert to one-hot vectors
maxs=np.max(X,axis=0)
X=np.ceil(X/maxs)
# Double check
print(X.shape)
print("DATA POINT-0:",X[0,0:10])
(5341, 29) DATA POINT-0: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
## SET X AND Y
# y: CONVERT FROM STRING LABELS TO INTEGERS
labels=[];
y=[]
for label in df["company"]:
if label not in labels:
labels.append(label)
print("index =",len(labels)-1,": label =",label)
for i in range(0,len(labels)):
if(label==labels[i]):
y.append(i)
y = np.array(y)
y = df['company']
X = X
# Double check
print(X.shape,y.shape)
print("DATA POINT-0:",X[0,0:10],"y =",y[0])
index = 0 : label = 1 index = 1 : label = 0 index = 2 : label = 2 (5341, 29) (5341,) DATA POINT-0: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] y = 0 1 0 0 0 2 Name: company, dtype: int64
# Partion data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=4)
# Check
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(4272, 29) (4272,) (1069, 29) (1069,)
The C parameter tells the SVM optimization how much to avoid misclassifying each training data point. High values of C give a smaller-margin hyperplane, while a small value of C will cause the optimizer to pick a larger-margin separating hyperplane. The C value essentially dictates the extent to which the model over or under fits. Examining the C values below show that the model is actually "stuck" because the accuracy of the train and test data are the same across various C values. Ideally, there would be different scores for different C paramters, and I will pick the C paramater that gives the highest accuracy, precision, and recall scores. As a result, this warrants further examination as I dig further into the project.
# Tune model with various C values
# Import svc library
from sklearn.svm import SVC
col_headers = ['c_param',
'accuracy_train',
'accuracy_test']
all_results = []
C = 500
for i in np.arange(1, 500, 50):
clf = SVC(C = i, kernel = 'linear')
clf.fit(X_train, y_train)
yp_train = clf.predict(X_train)
yp_test = clf.predict(X_test)
# append results to dataframe
results = [i,
accuracy_score(y_train, yp_train),
accuracy_score(y_test, yp_test)]
all_results.append(results)
# create a dataframe
svm_results_df = pd.DataFrame(all_results,
columns = col_headers)
svm_results_df.head()
c_param | accuracy_train | accuracy_test | |
---|---|---|---|
0 | 1 | 0.671582 | 0.642657 |
1 | 51 | 0.671582 | 0.642657 |
2 | 101 | 0.671582 | 0.642657 |
3 | 151 | 0.671582 | 0.642657 |
4 | 201 | 0.671582 | 0.642657 |
The SVM model produced the same accuracy of 67% on the training set and 64% on the test set across all four types of kernels. Please see a visualization of the accuracy scores below. The accuracy scores of either sets have not met the 85% threshold for a model to be considered accurate, which means the model is not considered accurate. Since it does not make much sense for the scores to be the same across all four kernels, for next steps, as I have more time to dig deeper into my data and model, I will examine where the model is "stuck" at and fix the problem.
Please also see a visualization of the confusion matrices below. The diagonal of the confusion matrix is dark blue whereas the other boxes are faintly colored, indicating that there is a high number of correct predictions for labels 0 and 1. However, there is a significant amount of misclassifications for labels 1 and 2, which warrants further examination as I have more time to see what is not working with the model.
# Initialize model
C=500
mykernel = 'sigmoid'
model = SVC(C=C,kernel=mykernel)
# Fit to training data
model.fit(X_train, y_train)
# Predict on X_train
y_train_pred = model.predict(X_train)
# TEST ACCURACY
# Training set
print("Training set")
print("Accuracy: ", accuracy_score(y_train, y_train_pred) * 100) #accuracy score
print("Number of mislabeled points: ", (y_train != y_train_pred).sum()) #mislabeled points
# Test set
y_test_pred = model.predict(X_test)
print("Test set")
print("Accuracy: ", accuracy_score(y_test, y_test_pred)*100) #accuracy score
print("Number of mislabeled points: ", (y_test != y_test_pred).sum()) #mislabeled points
# Plot
model_accuracies = pd.DataFrame({'Set':['Training set','Test set'], 'Accuracy (%)': [accuracy_score(y_train, y_train_pred) * 100, accuracy_score(y_test, y_test_pred)*100]})
sns.barplot(data=model_accuracies, x="Set", y="Accuracy (%)").set(title = 'Accuracy of '+str(mykernel)+' model for training vs test sets' )
plt.savefig("../501-project-website/images/svm_"+str(mykernel)+"_tweets_accuracy.png")
Training set Accuracy: 67.15823970037454 Number of mislabeled points: 1403 Test set Accuracy: 64.26566884939196 Number of mislabeled points: 382
# CONFUSION MATRIX
# Import confusion matrix library
from sklearn.metrics import plot_confusion_matrix
# Create confusion matrix
plot_confusion_matrix(model, X_test , y_test , cmap="Reds")
# Plot confusion matrix
plt.title("SVM_"+str(mykernel)+"_Confusion Matrix")
plt.savefig("../501-project-website/images/SVM_"+str(mykernel)+"_record_confusion_matrix.png")
/Users/cynthiang/opt/anaconda3/envs/class/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
# Initialize model
C=500
mykernel = 'linear'
model = SVC(C=C,kernel=mykernel)
model
# Fit to training data
model.fit(X_train, y_train)
# Predict on X_train
y_train_pred = model.predict(X_train)
# TEST ACCURACY
# Training set
print("Training set")
print("Accuracy: ", accuracy_score(y_train, y_train_pred) * 100) #accuracy score
print("Number of mislabeled points: ", (y_train != y_train_pred).sum()) #mislabeled points
# Test set
y_test_pred = model.predict(X_test)
print("Test set")
print("Accuracy: ", accuracy_score(y_test, y_test_pred)*100) #accuracy score
print("Number of mislabeled points: ", (y_test != y_test_pred).sum()) #mislabeled points
# Plot
model_accuracies = pd.DataFrame({'Set':['Training set','Test set'], 'Accuracy (%)': [accuracy_score(y_train, y_train_pred) * 100, accuracy_score(y_test, y_test_pred)*100]})
sns.barplot(data=model_accuracies, x="Set", y="Accuracy (%)").set(title = 'Accuracy of '+str(mykernel)+' model for training vs test sets' )
plt.savefig("../501-project-website/images/svm_"+str(mykernel)+"_tweets_accuracy.png")
Training set Accuracy: 67.15823970037454 Number of mislabeled points: 1403 Test set Accuracy: 64.26566884939196 Number of mislabeled points: 382
# CONFUSION MATRIX
# Import confusion matrix library
from sklearn.metrics import plot_confusion_matrix
# Create confusion matrix
plot_confusion_matrix(model, X_test , y_test , cmap="Reds")
# Plot confusion matrix
plt.title("SVM_"+str(mykernel)+"_Confusion Matrix")
plt.savefig("../501-project-website/images/SVM_"+str(mykernel)+"_record_confusion_matrix.png")
/Users/cynthiang/opt/anaconda3/envs/class/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
# Initialize model
C=500
mykernel = 'rbf'
model = SVC(C=C,kernel=mykernel)
model
# Fit to training data
model.fit(X_train, y_train)
# Predict on X_train
y_train_pred = model.predict(X_train)
# TEST ACCURACY
# Training set
print("Training set")
print("Accuracy: ", accuracy_score(y_train, y_train_pred) * 100) #accuracy score
print("Number of mislabeled points: ", (y_train != y_train_pred).sum()) #mislabeled points
# Test set
y_test_pred = model.predict(X_test)
print("Test set")
print("Accuracy: ", accuracy_score(y_test, y_test_pred)*100) #accuracy score
print("Number of mislabeled points: ", (y_test != y_test_pred).sum()) #mislabeled points
# Plot
model_accuracies = pd.DataFrame({'Set':['Training set','Test set'], 'Accuracy (%)': [accuracy_score(y_train, y_train_pred) * 100, accuracy_score(y_test, y_test_pred)*100]})
sns.barplot(data=model_accuracies, x="Set", y="Accuracy (%)").set(title = 'Accuracy of '+str(mykernel)+' model for training vs test sets' )
plt.savefig("../501-project-website/images/svm_"+str(mykernel)+"_tweets_accuracy.png")
Training set Accuracy: 67.15823970037454 Number of mislabeled points: 1403 Test set Accuracy: 64.92048643592142 Number of mislabeled points: 375
# CONFUSION MATRIX
# Import confusion matrix library
from sklearn.metrics import plot_confusion_matrix
# Create confusion matrix
plot_confusion_matrix(model, X_test , y_test , cmap="Reds")
# Plot confusion matrix
plt.title("SVM_"+str(mykernel)+"_Confusion Matrix")
plt.savefig("../501-project-website/images/SVM_"+str(mykernel)+"_record_confusion_matrix.png")
/Users/cynthiang/opt/anaconda3/envs/class/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
# Initialize model
C=500
mykernel = 'poly'
model = SVC(C=C,kernel=mykernel)
model
# Fit to training data
model.fit(X_train, y_train)
# Predict on X_train
y_train_pred = model.predict(X_train)
# TEST ACCURACY
# Training set
print("Training set")
print("Accuracy: ", accuracy_score(y_train, y_train_pred) * 100) #accuracy score
print("Number of mislabeled points: ", (y_train != y_train_pred).sum()) #mislabeled points
# Test set
y_test_pred = model.predict(X_test)
print("Test set")
print("Accuracy: ", accuracy_score(y_test, y_test_pred)*100) #accuracy score
print("Number of mislabeled points: ", (y_test != y_test_pred).sum()) #mislabeled points
# Plot
model_accuracies = pd.DataFrame({'Set':['Training set','Test set'], 'Accuracy (%)': [accuracy_score(y_train, y_train_pred) * 100, accuracy_score(y_test, y_test_pred)*100]})
sns.barplot(data=model_accuracies, x="Set", y="Accuracy (%)").set(title = 'Accuracy of '+str(mykernel)+' model for training vs test sets' )
plt.savefig("../501-project-website/images/svm_"+str(mykernel)+"_tweets_accuracy.png")
Training set Accuracy: 67.15823970037454 Number of mislabeled points: 1403 Test set Accuracy: 64.26566884939196 Number of mislabeled points: 382
# CONFUSION MATRIX
# Import confusion matrix library
from sklearn.metrics import plot_confusion_matrix
# Create confusion matrix
plot_confusion_matrix(model, X_test , y_test , cmap="Reds")
# Plot confusion matrix
plt.title("SVM_"+str(mykernel)+"_Confusion Matrix")
plt.savefig("../501-project-website/images/SVM_"+str(mykernel)+"_record_confusion_matrix.png")
/Users/cynthiang/opt/anaconda3/envs/class/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
In conclusion, I think this model is not very accurate in predicting which word comes from which Tweet. The accuracy scores for both training and test sets do not meet the 85% threshold for a model to be considered accurate.
One future direction is to work on considering the context of a word in its label. For example, I hope to include the context of a word's sentence or Tweets in the classification, so that in the Count Vectorizer function, each row represents a Tweet and each column represents a distinct vocabulary. To do this, I need to go back to my data cleaning sections and make sure that the context of a word is included in the cleaned Tweets datasets.
Another future direction is for me to explore where the model is "stuck", since the accuracy score across the four different kernels remain the same, indicating that something about the model and/or dataset went wrong. It is possible that adjusting the dataset so that Count Vectorizer reflects the context of the word in a Tweet would rectify this situation and can help me find a model that works best.