subreddit | average_toxic_score | |
---|---|---|
0 | ASRoma | 1.126782 |
1 | borussiadortmund | 1.171257 |
2 | Gunners | 1.137281 |
3 | Barca | 1.126622 |
4 | Juve | 1.163329 |
5 | fcbayern | 1.122368 |
6 | atletico | 1.172624 |
7 | reddevils | 1.164603 |
8 | psg | 1.113934 |
9 | chelseafc | 1.110307 |
10 | realmadrid | 1.152929 |
11 | ACMilan | 1.150824 |
12 | schalke04 | 1.095682 |
NLP - Reddit Data about Soccer
Topics:
- Business Goal: Identify the teams linked to the highest levels of toxicity in Reddit conversations.
- Technical Proposal: Create a manual dictionary mapping major team and league names to their respective subreddit names. Gather Reddit posts from a variety of subreddits listed in the dictionary. Establish a processing pipeline to clean and tokenize the text of these posts. Train a toxicity content classifier using a labeled dataset obtained from Kaggle. Apply the trained model to the preprocessed posts and visualize the results using a bar plot.
- Business Goal: Determine if there is a relationship between the changes of sentiment in soccer subreddits and match results (e.g., wins, losses).
- Technical Proposal: Match the Reddit data with key soccer events using dates and team/player mentions. Utilize NLP techniques to assign sentiment scores to each post and comment, and label the dataset with sentiment scores. Investigate sentiment before, during, and after major events. Display sentiment trends over time with event markers.
Business goal: Examine how soccer fans feel about the 2022 World Cup, and how their feelings evolve over time.
Technical proposal: First extract all the posts relevant to the 2022 World Cup from all the available records. Then apply sentiment analysis for each post to examine the feeling towards the World Cup. Finally track the sentiment over time to see how soccer fans’ feelings change.
Executive summary
Business Goal 1:
The study reveals varying toxicity levels in soccer club subreddits, with Atletico’s having the highest, possibly due to the club’s violent reputation. Toxicity peaks on weekends and mid-week, aligning with league and UEFA games, but dips during national team matches, indicating fluctuating fan behavior.
Business Goal 2:
The NLP analysis proves that there are relationships between the sentiment scores of fans’ comments and the match results. The analysis examines the sentiment situation of fans of Arsenal, Manchester United and Chelsea, and the results indicate that the fans of all three clubs were showing mostly negative moods, especially when the team lost or tied in a match.
Business Goal 3:
This NLP analysis suggests that of all the sub-topics about the World Cup 2022, people discussed the players most frequently. Additionally, a large amount of people might not be satisfied with certain judgments from the referees during the World Cup. It is also noticeable that people’s general attitude towards the World Cup 2022 did not change much over time.
Analysis report
Business Goal 1: Identify the teams linked to the highest levels of toxicity in Reddit conversations.
External Data Source: Toxic Comment Classification
To achieve this business goal, we trained the model using a Wikipedia comments dataset, which has been evaluated by human raters for toxic behavior. There are six different types of toxicity, and we assigned toxicity scores as follows: toxic (1 point), insult (2 points), obscene (2 points), identity hate (3 points), threat (3 points), and severe toxicity (4 points). We utilized a pipeline to process the text to ensure efficient flow and processing of textual data for classification purposes. The pipeline outlined is designed for text classification, starting with the DocumentAssembler which prepares raw text for processing. It includes a SentenceDetector and Tokenizer for breaking down the text into sentences and tokens, respectively. Sentence embeddings are generated using the UniversalSentenceEncoder, which are then fed into the MultiClassifierDLApproach for classification.
According to the classification result, the first tabel reveals varying levels of toxicity in comments across soccer club subreddits, with atletico’s subreddit showing the highest average toxicity score (1.172) and schalke04’s the lowest (1.095). It is understandable the Atletico has the highest toxicity score since the it is widely known as one of the most violent clubs in Europe.
day_of_week | average_toxic_score | |
---|---|---|
0 | 1 | 1.259511 |
1 | 2 | 1.119115 |
2 | 3 | 1.155359 |
3 | 4 | 1.196851 |
4 | 5 | 1.204669 |
5 | 6 | 1.126858 |
6 | 7 | 1.248906 |
We also found that different days of week have different level of toxicity score. On weekends, the toxic comments are more than weekdays since league games are played on weekends so people tend to have hostile comments during and after games. It is noticeable that the scores on Wednesday and Thursday is also relatively higher because UEFA Champions League and Europa League played on these days.
According to the lineplot, the weekly periodic pattern is obvious, which has been proved by the table of toxicity score for different days of week. Instead of club games, national team games were held from March 20 to March 28 so the toxicity is fairly low during this time.
Business Goal 2: Determine if there is a relationship between the changes of sentiment in soccer subreddits and match results (e.g., wins, losses).
External Data Source: Match Result Data
To implement this task, first match the Reddit data of the most commented clubs with soccer matches result data using dates. Utilize NLP techniques to assign sentiment scores to each post and comment, and label the dataset with sentiment scores. Investigate sentiment before, during, and after matches. Display sentiment trends over time with event markers and align with match results to study the relationship between these two events.
This is a combined line and bar plot, showing the number of comments from Arsenal fans of positive and negative sentiments as lines, and the match result as bars (also showing the average number of positive and negative comments on matchdays).
We can draw some interesting findings from this sentiment analysis: First, on the date when there was a match for Arsenal the comments number are much higher than the days when there was no match; Second, Arsenal fans tend to have rather balanced sentiments in their comments when there is no matches, however when Arsenal lost or draw a game, the number of negative comments are always higher than the positive comments, many times much higher. Even when Arsenal won, the fans did not really show much more positivity, sometimes even more negative when a match was won. Consider Arsenal has historically not been doing great during January to March, as they lost their leading position in the premier league in this time in 2023, it is understandable that their fans are not very optimistic.
It could be interesting to have a positive/negative ratio for the lines, so a similar chart where the line showing the ratio is added. The plot of single line of the positive/negative comments ratio can be more straightfoward than the plot of 2 lines of both positive and negative comments.
This map represents the ratio of number of positive comments to negative comments on each day for Arsenal fans, along with the match results for Arsenal. In this plot, a similar interesting finding is that the days that have the lowest sentiment ratio are always the days that the team lost the match or had a draw. This is a reasonable result, since a bad losing day will make the fans upset. However, it is also found that a winning match day does not always generate a good sentiment ratio, and the highest ratio often happends on non-match days. We can infer from this result that, at least for Arsenal fans, the sentiment is more normal or positive on non-match days. Even if the team is winning, the fans might still be upset about the team’s performance.
Business Goal 3: Examine how soccer fans feel about the 2022 World Cup, and how their feelings evolve over time.
This section analyzes sentiment over time for Reddit comments related to the 2022 World Cup. Firstly, a regex pattern, defined by key words, filters relevant posts. Then, three dummy variables were introduced to determine if posts are related to champions, players, or referees. A pre-trained sentiment analysis model was employed to categorize the sentiment of each post. Here is a resulting cross-tabulation showing the post counts across sentiment categories within each World Cup sub-topic.
category | negative | neutral | positive | |
---|---|---|---|---|
0 | champion | 4976 | 637 | 5414 |
1 | players | 19230 | 2104 | 14753 |
2 | referee | 3963 | 307 | 1877 |
3 | total | 85860 | 8408 | 63339 |
A grouped barplot is also created to visualize the sentiment by category.
This plot reveals a notably higher post volume about players compared to other sub-topics. Additionally, the champion-related posts had a greater share of positive sentiment, while referee-related posts skewed negative, hinting at dissatisfaction with referees’ decisions.
In the subsequent section, sentiment analysis was conducted on a weekly basis, calculating the total, positive, negative, and neutral post counts for each week within the timeframe. Here is a table summarizing this weekly data and a time series plot for visualization.
week | total_posts | negative | neutral | positive | positive_percentage(%) | |
---|---|---|---|---|---|---|
0 | 2023-01-01 | 1611 | 868 | 83 | 660 | 40.968343 |
1 | 2023-01-08 | 12624 | 7095 | 648 | 4881 | 38.664449 |
2 | 2023-01-15 | 12644 | 7010 | 658 | 4976 | 39.354635 |
3 | 2023-01-22 | 12486 | 6883 | 676 | 4927 | 39.460195 |
4 | 2023-01-29 | 10791 | 5666 | 611 | 4514 | 41.831156 |
5 | 2023-02-05 | 8043 | 4374 | 423 | 3246 | 40.358075 |
6 | 2023-02-12 | 12529 | 6796 | 642 | 5091 | 40.633730 |
7 | 2023-02-19 | 15376 | 8656 | 821 | 5899 | 38.364984 |
8 | 2023-02-26 | 12475 | 6697 | 739 | 5039 | 40.392786 |
9 | 2023-03-05 | 11827 | 6126 | 620 | 5081 | 42.961021 |
10 | 2023-03-12 | 12381 | 6655 | 656 | 5070 | 40.949843 |
11 | 2023-03-19 | 14720 | 7740 | 801 | 6179 | 41.976902 |
12 | 2023-03-26 | 12656 | 7183 | 693 | 4780 | 37.768647 |
13 | 2023-04-02 | 7444 | 4111 | 337 | 2996 | 40.247179 |
The time series plot shows no distinct trend in post volume throughout the period. The numbers of positive and negative posts paralleled the overall post trend, while the count of neutral posts remained low and steady. The proportion of positive posts was relatively unchanged, indicating that general attitudes towards the World Cup did not significantly fluctuate over time.