EDA - Reddit Data about Soccer
EDA Topics
EDA
Business Goal: Evaluate the relative popularity of teams and leagues by ranking the number of posts published between January 2023 and March 2023.
Technical Proposal: Develop a manual dictionary that associates major team and league names with their respective subreddit counterparts. Extract Reddit posts from a diverse set of subreddits listed in the dictionary. Subsequently, calculate the post count for each subreddit and visualize the findings using a bar plot.
Business Goal: Determine which days see the most activity from soccer fans, and whether this is driven by match days.
Technical Proposal: Get external soccer match schedule and data for the same time frame, filter and merge the Reddit activity data with the match schedule data based on the date. Count daily posts and comments in soccer subreddits, flag days with matches using the external database, and assess the relationship between match days and activity spikes. For visualization methods, illustrate the volume of posts/comments over time and compare average daily activity on match days versus non-match days.
Business goal: Identify what aspects of soccer are most frequently discussed in different subreddits and what kind of topics are most likely to acquire high scores.
Technical proposal: Use NLP to identify the posts that mention different aspects of soccer, including tactics, league policies, transfers or fan culture. Then count the number of posts and calculate the average score of each aspect. Create several plots, like pie charts or bar plots, to visualize the count number and score of the most popular topics.
1. Basic Information Exploration
Filter Data Shape: 5617852 x 9.
Filter Data Time Range:
The minimum created_utc is: 2023-01-01 00:00:00
The maximun created_utc is: 2023-03-31 23:59:59
Filter Data Schema:
|– author: string (nullable = true)
|– author_flair_text: string (nullable = true)
|– body: string (nullable = true)
|– controversiality: long (nullable = true)
|– created_utc: timestamp (nullable = true)
|– gilded: long (nullable = true)
|– score: long (nullable = true)
|– stickied: boolean (nullable = true)
|– subreddit: string (nullable = true)
2. Find out the Most Popular Leagues and Clubs on Reddit
Figure 1 provided data reveals varying levels of activity across major soccer league subreddits. “PremierLeague” and “MLS” stand out as highly active communities, while “worldcup” also enjoys a substantial presence. In contrast, “seriea,” “Bundesliga,” and “LaLiga” have comparatively lower levels of engagement within their respective subreddits.
Figure 2, this barplot provides comment counts for popular European soccer clubs (three clubs for Premier League, La Liga, Serie A, and Bundesliga). The Premier League dominates, reflecting active discussions among English club enthusiasts. La Liga’s “Barca” enjoys a robust presence, while Serie A’s “ACMilan” showcases dedicated online discussions. Comparing with Premier League and La Liga, Serie A and Bundesliga are not popular on reddit but major clubs such as Milan and FC Bayern still stand out of the other clubs.
Common Words of Comments on Soccer Subreddit
The updated word could (generated by TF-IDF) shows the importance of the words. Instead of counting the words directly, TF-IDF model could calculated the score of relevance for words in text. According to the word cloud, Reddit users in soccer subreddit care about the performance of the teams and players. Also, some leagues and clubs such as the Premier League and Real Madrid are also important for the users’ discussions.
3. Explore the relationship between user activity and match days
This table represents the number of matches played on each day, in the top five leagues in Europe.
date | PremierLeague | LaLiga | Bundesliga | SerieA | Ligue1 | ChampionsLeague | TotalMatches | |
---|---|---|---|---|---|---|---|---|
0 | 2023-01-01 | 2 | 0 | 0 | 0 | 6 | 0 | 8 |
1 | 2023-01-02 | 1 | 0 | 0 | 0 | 4 | 0 | 5 |
2 | 2023-01-03 | 4 | 0 | 0 | 0 | 0 | 0 | 4 |
3 | 2023-01-04 | 4 | 0 | 0 | 10 | 0 | 0 | 14 |
4 | 2023-01-05 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
5 | 2023-01-06 | 0 | 2 | 0 | 0 | 0 | 0 | 2 |
Explore the relationship between the Daily Comments Counts and the Number of the Top League Matches Played
Line Plot
This line plot shows the pattern of the daily number of Reddit comments in the subreddits about soccer, and the daily number of soccer matches played in top leagues. There are similar traits in the two lines, for example on 2023-01-30 and 2023-01-31 there are few to none matches played, and the comments count on Reddit for these two days are also low; For every few days there are two days where around 20 matches are played, and on these days the number of Reddit comments also peaked. This time series line charts therefore show some degree of correlation between the number of comments and matches for each day.
To track down the correlation, the correlation coefficient metric between the number of comments and matches each day was calculated to be 0.614. Since 0.614 is not a large number, we can say that the two counts are moderately positively correlated. To further explore this, a scatter plot between the two counts along with a regression line is shown below.
Scatter Plot
Based on the scatter plot, the regression line shows that there is a positive relation between the number of matches and reddit comments per day, and the correlation is moderately strong.
4. Identify the most frequently discussed topic about soccer
For this problem, six different topics have been selected, which are commonly discussed in soccer forums:
- Soccer tactics
- Transfers
- Player performances
- Player rankings
- Match analysis and previews
- Managerial decisions
Each topic is associated with a series of relevant keywords. A search is conducted in the ‘body’ column to determine whether each post relates to a particular topic. Six dummy variables have been created to indicate whether each post contains information on these topics: ‘soccer_tactics’, ‘transfers’, ‘player_performances’, ‘player_rankings’, ‘match_analysis_previews’, ‘managerial_decisions’.
Subsequently, calculations are performed to determine the total number, average score, and gild percentage of each topic. Finally, various visualizations are generated to present the data output.
topic | number of rows | average score | gild percentage (%) | |
---|---|---|---|---|
0 | Soccer Tactics | 104584 | 12.28 | 0.030 |
1 | Transfers | 224669 | 16.02 | 0.024 |
2 | Player Performances | 327807 | 15.93 | 0.053 |
3 | Player Rankings | 75402 | 15.37 | 0.036 |
4 | Match Analysis and Previews | 44770 | 15.12 | 0.054 |
5 | Managerial Decisions | 184505 | 15.89 | 0.027 |
Barplot: Number of Rows per Topic
Pie chart: Proportion of Posts per Topic
According to these two charts above, we can observe that Player Performances is the hottest topic in the Reddit posts related to soccer, while Match Analysis and Previews is discussed less frequently compared to other topics.
Barplot: Average Score per Topic
From this plot, we can conclude that the average score of posts related to Soccer Tactics is significantly lower than other topics, whereas the average scores of other five topics are quite similar. This suggests that users on Reddit are not so interested in the posts related to Soccer Tactics compared to other topics about soccer.
Barplot: Gild Percentage per Topic
In Reddit, “gilding” refers to the act of giving an award to a user’s post or comment as a token of appreciation or recognition. Unlike the average score distribution of each topic, the gild percentage of the topics show greater variety. Posts related to Player Performances or Match Analysis and Previews tend to have higher probabilities to be gilded. Also, it is interesting that although the number of posts about Match Analysis and Previews are smaller than other topics, the gild percentage of this topic is the highest among all.
Comment Number Change by Hour
According to the lineplot above, the number of comments increases constantly before 3 pm and reach the peak round 8 at night.