Here is the 4 steps taken in overview to perform this analysis:
Data Extraction:
The first step in this project is data extraction, where the transcripts of the comedians' stand-up routines are scraped from a website using the requests and BeautifulSoup libraries. The transcripts are then saved in a pickle file for later use. This allows us to easily access and manipulate the data in the subsequent steps of the analysis.
Data Cleaning:
The next step is data cleaning, where the transcripts are combined into a single string of text for each comedian and any irrelevant information is removed. This is done to make it easier to analyze the data and focus on the content of the comedians' routines. The cleaned data is then saved as a Pandas DataFrame.
Exploratory Data Analysis:
Once the data is cleaned, we can begin the EDA process. This involves using various techniques such as word frequency analysis and word clouds to identify common themes and words used by the comedians. These analyses can give us an idea of the overall style and content of the comedians' routines.
Sentiment Analysis:
Finally, we use Sentiment Analysis to measure the overall positivity or negativity of the transcripts. This can give us a sense of the comedians' mood and the emotional impact of their material.
Conclusion:
By combining EDA and Sentiment Analysis, we can gain a deeper understanding of the similarities and differences between the comedians. This can help us appreciate the diversity within the world of stand-up comedy and better understand what makes each performer unique.
You can check the notebooks used for this project right here!
Data Extraction:
The data extraction step involves scraping the transcripts of the comedians' stand-up routines from a website and saving them for later use. To do this, we use the requests library to send a request to the website and retrieve the HTML. Then, we use the BeautifulSoup library to parse the HTML and extract the relevant information, which in this case is the transcript of the stand-up routine.
Here is a code snippet that demonstrates how to extract the transcript from a single webpage:
import requests
from bs4 import BeautifulSoup
# Send a request to the webpage and retrieve the HTML content
page = requests.get(url).text
# Parse the HTML content
soup = BeautifulSoup(page, 'lxml')
# Extract the transcript using the BeautifulSoup object
transcript = [p.text for p in soup.find(attrs={'data-id': '74af9a5b'}).find_all('p')]
In the above code snippet, url
is the URL of the webpage that we want to scrape. The page
variable stores the HTML content of the webpage, which we then parse using the BeautifulSoup object. We can then extract the transcript by searching for the relevant HTML tags and attributes using the BeautifulSoup object.
To extract the transcripts from multiple webpages, we can use a loop to iterate through a list of URLs and apply the above process to each one. We can also use try-except blocks to handle any errors that may occur and skip over any URLs that do not work.
Once the transcripts have been extracted, we can use the pickle library to save them to a file for later use. This allows us to easily access and manipulate the data in the subsequent steps of the analysis.
import pickle
# Save the transcript to a pickle file
with open('transcript.pkl', 'wb') as f:
pickle.dump(transcript, f)
In the above code snippet, transcript
is the list of strings that contains the transcript. The wb
mode indicates that we are opening the file in write binary mode, which allows us to save the pickle object to the file.
Data cleaning:
The data cleaning step involves combining the transcripts into a single string of text for each comedian and removing any irrelevant information. This is done to make it easier to analyze the data and focus on the content of the comedians' routines.
To clean the text further, we can use regular expressions to remove any unwanted characters or patterns from the string.
import re
def clean_text_first(text):
text = re.sub("'", "", text)
text = re.sub("\[(.*)\]", "", text)
text = re.sub("[^a-zA-Z]", " ", text)
text = re.sub("\s+", " ", text)
return text
In the above code snippet, the re
module provides access to functions for working with regular expressions. The clean_text_first()
function uses several regular expression substitutions to remove unwanted characters and patterns from the text
input. For example, the first substitution removes single quotation marks from the text, the second substitution removes any text that is enclosed in square brackets, and the third substitution removes any non-letter characters. The fourth substitution replaces multiple consecutive whitespace characters with a single space.
After the initial cleaning, we noticed that there was additional cleaning steps needed. So we made a second function
def clean_text_second(text):
text = re.sub("[^a-zA-Z]", " ", text)
text = re.sub("\s+", " ", text)
return text
The clean_text_second()
function uses regular expression substitutions similar to the ones in clean_text_first()
, but with slightly different patterns. This function removes all non-letter characters and replaces multiple consecutive whitespace characters with a single space.
Once the text is cleaned, the next step is to perform tokenization, which involves splitting the text into individual tokens or words . This can be done using the CountVectorizer
class from the sklearn
library.
from sklearn.feature_extraction.text import CountVectorizer
def create_dtm(text):
vectorizer = CountVectorizer(stop_words='english')
dtm = vectorizer.fit_transform(text)
return dtm, vectorizer
In the above code snippet, the CountVectorizer
class is imported from sklearn
. The create_dtm()
function creates a Document Term Matrix (DTM) from the input text
, which is a list of strings.
The CountVectorizer
object is initialized with the stop_words
parameter set to 'english'
, which specifies that the English stop words should be removed from the text before creating the DTM. Stop words are common words that do not add much meaning to the text, such as "the", "a", and "an".
The fit_transform()
method of the CountVectorizer
object is then used to create the DTM from the input text. The resulting DTM is a matrix where each row represents a document (in this case, a comedian's transcript) and each column represents a word. The cells of the matrix contain the word counts for each word in each document.
The vectorizer
object is also returned by the create_dtm()
function. This object can be used to transform new text into a DTM using the transform()
method.
Finally the data can be saved and stored to further usage.
EDA:
In this project, we use EDA to identify common themes and words used by the comedians, which can help us understand the overall style and content of their routines.
Here are the techniques used in this project:
To perform EDA on the transcripts, we can use the cleaned and tokenized data that we created in the previous steps. For example, we can use the Document Term Matrix (DTM) to count the frequency of each word and create a word cloud or perform Sentiment Analysis.
To perform EDA on the transcripts, we can use the cleaned and tokenized data that we created in the previous steps. For example, we can use the Document Term Matrix (DTM) to count the frequency of each word and create a word cloud or perform Sentiment Analysis.
After performing the Wordcloud on the transcripts of the comedians, we may want to visualize the results in a meaningful way. One way to do this is to plot the number of unique words used by each comedian and the number of words spoken per minute.
Here is an example of how to create these plots using the matplotlib
library:
import matplotlib.pyplot as plt
# Plot the number of unique words used by each comedian
plt.figure(figsize=(12,6))
plt.bar(comedians, unique_words)
plt.ylabel("Palabras únicas")
plt.xlabel("Comedian")
plt.xticks(rotation=90)
plt.show()
# Plot the number of words spoken per minute by each comedian
plt.figure(figsize=(12,6))
plt.bar(comedians, words_per_minute)
plt.ylabel("Palabras por minuto")
plt.xlabel("Comedian")
plt.xticks(rotation=90)
plt.show()
Finally, I found something interesting about the comedians, they tend to use a lot of profanity on their sketches, so we may want to investigate the number of swear words used by each comedian and the overall sentiment of their routines.
To investigate this relationship, we can create a scatter plot using the matplotlib
library.
for i, comedian in enumerate(df_profanity.index):
x= df_profanity.f___.loc[comedian]
y= df_profanity.s___.loc[comedian]
plt.scatter(x,y, label=comedian, color = 'blue')
plt.text(x+1.5,y+1,full_names[i], fontsize=10)
plt.ylim(-5,70)
plt.title('Number of profanity words per comedian', fontsize = 20)
plt.xlabel('Number of F words')
plt.ylabel('Number of S words')
plt.show()
Sentiment Analysis:
Sentiment Analysis is the process of using natural language processing and machine learning techniques to extract and analyze the sentiment or emotion expressed in text. In this project, we can use Sentiment Analysis to measure the overall positivity or negativity of the transcripts of the comedians.
To perform Sentiment Analysis, we can use the TextBlob
library, which provides a simple interface for performing various natural language processing tasks, including Sentiment Analysis.
#Creating variables to store the functions
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity
df['polarity'] = df['transcript'].apply(pol)
df['subjetivity'] = df['transcript'].apply(sub)
After applying the functions to the corpus dataset, we will proceed to visualize the graphs
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
for i, comedian in enumerate(df.index):
x = df.polarity.loc[comedian]
y = df.subjetivity.loc[comedian]
plt.scatter(x, y, color='blue')
plt.text(x+0.001, y+0.001, df.full_name[i], fontsize=10)
plt.xlim(-0.01, .12)
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)
plt.show()
In conclusion, the analysis of comedian secrets revealed several interesting insights about the comedians and their routines. By using web scraping and text cleaning, we were able to extract the transcript from each comedian's routine. We then used EDA to identify common themes and words used by the comedians, and used sentiment analysis to measure the overall positivity or negativity of the transcripts. Finally, we used visualizations to explore the relationship between the number of profanity words used by each comedian and the overall sentiment of their routines. All of these insights can help us understand the content and style of the comedians' routines and give us a better understanding of the comedians' comedic style.
Feel free to contact me for any question :)
I'll be getting back to you as soon as possible
🟥⬜🟥