Network Analysis of Subreddits: Graph Theory and Community Detection
Reddit's ecosystem of interconnected communities forms a rich network structure. By analyzing relationships between subreddits—through cross-posting, shared users, and topic similarity—we can discover hidden connections, influential hubs, and community clusters.
What You'll Learn
Building subreddit relationship graphs, centrality measures, community detection algorithms, and visualization techniques using NetworkX and Python.
Types of Subreddit Networks
| Network Type | Edge Definition | Use Case |
|---|---|---|
| Cross-posting network | Posts shared between subreddits | Content flow analysis |
| User overlap network | Shared active users | Audience similarity |
| Comment reference network | Subreddit mentions in comments | Community awareness |
| Topic similarity network | Semantic similarity of content | Topic clustering |
| Moderation network | Shared moderators | Governance patterns |
Building the Network
Successfully installed networkx-3.2.1 python-louvain-0.16
import networkx as nx import pandas as pd import numpy as np from collections import defaultdict from typing import List, Dict, Tuple class SubredditNetworkBuilder: """Build subreddit relationship networks from Reddit data.""" def __init__(self): self.graph = nx.Graph() def build_crosspost_network( self, posts_df: pd.DataFrame, min_crossposts: int = 5 ) -> nx.Graph: """ Build network from crosspost relationships. Args: posts_df: DataFrame with 'subreddit' and 'crosspost_parent_list' columns min_crossposts: Minimum crossposts to create edge """ edge_weights = defaultdict(int) for _, post in posts_df.iterrows(): source = post['subreddit'] # Extract crosspost destinations if post.get('crosspost_parent_list'): for parent in post['crosspost_parent_list']: target = parent.get('subreddit') if target and source != target: edge = tuple(sorted([source, target])) edge_weights[edge] += 1 # Create graph with filtered edges self.graph = nx.Graph() for (source, target), weight in edge_weights.items(): if weight >= min_crossposts: self.graph.add_edge(source, target, weight=weight) print(f"Network: {self.graph.number_of_nodes()} nodes, {self.graph.number_of_edges()} edges") return self.graph def build_user_overlap_network( self, posts_df: pd.DataFrame, min_shared_users: int = 10 ) -> nx.Graph: """ Build network based on shared users between subreddits. Args: posts_df: DataFrame with 'subreddit' and 'author' columns min_shared_users: Minimum shared users to create edge """ # Get users per subreddit subreddit_users = posts_df.groupby('subreddit')['author'].apply(set).to_dict() subreddits = list(subreddit_users.keys()) self.graph = nx.Graph() # Calculate overlaps for i, sub1 in enumerate(subreddits): for sub2 in subreddits[i+1:]: overlap = len(subreddit_users[sub1] & subreddit_users[sub2]) if overlap >= min_shared_users: # Jaccard similarity for normalization union_size = len(subreddit_users[sub1] | subreddit_users[sub2]) similarity = overlap / union_size if union_size > 0 else 0 self.graph.add_edge( sub1, sub2, weight=overlap, similarity=similarity ) print(f"Network: {self.graph.number_of_nodes()} nodes, {self.graph.number_of_edges()} edges") return self.graph def build_mention_network( self, comments_df: pd.DataFrame, min_mentions: int = 3 ) -> nx.DiGraph: """ Build directed network from subreddit mentions in comments. Args: comments_df: DataFrame with 'subreddit' and 'body' columns min_mentions: Minimum mentions to create edge """ import re edge_weights = defaultdict(int) pattern = re.compile(r'r/([a-zA-Z0-9_]+)') for _, comment in comments_df.iterrows(): source = comment['subreddit'] body = comment.get('body', '') mentions = pattern.findall(body) for mention in mentions: if mention.lower() != source.lower(): edge_weights[(source, mention)] += 1 # Create directed graph self.graph = nx.DiGraph() for (source, target), weight in edge_weights.items(): if weight >= min_mentions: self.graph.add_edge(source, target, weight=weight) print(f"Directed Network: {self.graph.number_of_nodes()} nodes, {self.graph.number_of_edges()} edges") return self.graph def add_node_attributes( self, subreddit_stats: pd.DataFrame ): """Add subreddit metadata as node attributes.""" for _, row in subreddit_stats.iterrows(): sub = row['subreddit'] if sub in self.graph.nodes(): self.graph.nodes[sub]['subscribers'] = row.get('subscribers', 0) self.graph.nodes[sub]['post_count'] = row.get('post_count', 0) self.graph.nodes[sub]['category'] = row.get('category', 'unknown') # Usage builder = SubredditNetworkBuilder() G = builder.build_user_overlap_network(posts_df, min_shared_users=20)
Centrality Analysis
Identify influential subreddits using different centrality measures:
class SubredditCentralityAnalyzer: """Analyze subreddit importance using centrality measures.""" def __init__(self, graph: nx.Graph): self.graph = graph self.centralities = {} def calculate_all_centralities(self) -> pd.DataFrame: """Calculate multiple centrality measures.""" # Degree centrality: number of connections self.centralities['degree'] = nx.degree_centrality(self.graph) # Weighted degree (strength) self.centralities['strength'] = dict( self.graph.degree(weight='weight') ) # Betweenness: bridges between communities self.centralities['betweenness'] = nx.betweenness_centrality(self.graph) # Closeness: average distance to all nodes self.centralities['closeness'] = nx.closeness_centrality(self.graph) # Eigenvector: connected to important nodes try: self.centralities['eigenvector'] = nx.eigenvector_centrality(self.graph, max_iter=1000) except: self.centralities['eigenvector'] = {n: 0 for n in self.graph.nodes()} # PageRank: importance via link structure self.centralities['pagerank'] = nx.pagerank(self.graph, weight='weight') # Combine into DataFrame df = pd.DataFrame({ 'subreddit': list(self.graph.nodes()), 'degree': [self.centralities['degree'][n] for n in self.graph.nodes()], 'strength': [self.centralities['strength'][n] for n in self.graph.nodes()], 'betweenness': [self.centralities['betweenness'][n] for n in self.graph.nodes()], 'closeness': [self.centralities['closeness'][n] for n in self.graph.nodes()], 'eigenvector': [self.centralities['eigenvector'][n] for n in self.graph.nodes()], 'pagerank': [self.centralities['pagerank'][n] for n in self.graph.nodes()] }) return df def get_top_subreddits( self, metric: str = 'pagerank', n: int = 20 ) -> pd.DataFrame: """Get top subreddits by specified metric.""" df = self.calculate_all_centralities() return df.nlargest(n, metric)[['subreddit', metric]] def identify_bridge_subreddits(self) -> List[str]: """Find subreddits that connect different communities.""" betweenness = self.centralities.get('betweenness') if betweenness is None: betweenness = nx.betweenness_centrality(self.graph) threshold = np.percentile(list(betweenness.values()), 90) bridges = [n for n, v in betweenness.items() if v > threshold] return bridges # Usage analyzer = SubredditCentralityAnalyzer(G) centralities_df = analyzer.calculate_all_centralities() print("Top subreddits by PageRank:") print(analyzer.get_top_subreddits('pagerank', 10)) print("\\nBridge subreddits:") print(analyzer.identify_bridge_subreddits())
Community Detection
Discover clusters of related subreddits:
import community as community_louvain class SubredditCommunityDetector: """Detect communities in subreddit networks.""" def __init__(self, graph: nx.Graph): self.graph = graph self.communities = {} def louvain_detection(self, resolution: float = 1.0) -> Dict[str, int]: """ Detect communities using Louvain algorithm. Args: resolution: Higher = more communities """ partition = community_louvain.best_partition( self.graph, resolution=resolution, random_state=42 ) self.communities['louvain'] = partition return partition def label_propagation(self) -> Dict[str, int]: """Detect communities using label propagation.""" communities_generator = nx.algorithms.community.label_propagation_communities(self.graph) partition = {} for idx, community in enumerate(communities_generator): for node in community: partition[node] = idx self.communities['label_prop'] = partition return partition def get_community_summary(self, method: str = 'louvain') -> pd.DataFrame: """Summarize detected communities.""" partition = self.communities.get(method) if partition is None: raise ValueError(f"Run {method} detection first") # Group subreddits by community community_members = defaultdict(list) for sub, comm in partition.items(): community_members[comm].append(sub) # Create summary summaries = [] for comm_id, members in community_members.items(): # Calculate community metrics subgraph = self.graph.subgraph(members) summaries.append({ 'community_id': comm_id, 'size': len(members), 'members': ', '.join(members[:5]) + ('...' if len(members) > 5 else ''), 'density': nx.density(subgraph), 'internal_edges': subgraph.number_of_edges() }) return pd.DataFrame(summaries).sort_values('size', ascending=False) def calculate_modularity(self, method: str = 'louvain') -> float: """Calculate modularity score for partition quality.""" partition = self.communities.get(method) if partition is None: raise ValueError(f"Run {method} detection first") # Convert to community sets format community_sets = defaultdict(set) for node, comm in partition.items(): community_sets[comm].add(node) communities_list = list(community_sets.values()) modularity = nx.algorithms.community.modularity(self.graph, communities_list) return modularity def find_similar_subreddits( self, subreddit: str, method: str = 'louvain' ) -> List[str]: """Find subreddits in the same community.""" partition = self.communities.get(method) if partition is None: raise ValueError(f"Run {method} detection first") if subreddit not in partition: return [] target_community = partition[subreddit] similar = [s for s, c in partition.items() if c == target_community and s != subreddit] return similar # Usage detector = SubredditCommunityDetector(G) communities = detector.louvain_detection(resolution=1.0) print(f"Modularity: {detector.calculate_modularity():.3f}") print("\\nCommunity Summary:") print(detector.get_community_summary()) # Find similar subreddits similar = detector.find_similar_subreddits('python') print(f"\\nSubreddits similar to r/python: {similar[:10]}")
Network Visualization
from pyvis.network import Network import matplotlib.pyplot as plt class SubredditNetworkVisualizer: """Visualize subreddit networks.""" def __init__(self, graph: nx.Graph, communities: Dict[str, int] = None): self.graph = graph self.communities = communities def create_interactive_viz( self, output_file: str = 'subreddit_network.html', height: str = '800px' ): """Create interactive HTML visualization using pyvis.""" net = Network(height=height, width='100%', bgcolor='#1A1A1B', font_color='white') # Color palette for communities colors = [ '#FF4500', '#0079D3', '#00A6A5', '#FF8717', '#7193FF', '#46D160', '#FF66AC', '#FFD635' ] # Add nodes for node in self.graph.nodes(): size = min(self.graph.degree(node) * 2 + 10, 50) if self.communities: comm_id = self.communities.get(node, 0) color = colors[comm_id % len(colors)] else: color = '#FF4500' net.add_node( node, label=node, size=size, color=color, title=f"r/{node}\\nConnections: {self.graph.degree(node)}" ) # Add edges for source, target, data in self.graph.edges(data=True): weight = data.get('weight', 1) net.add_edge(source, target, value=weight) # Physics settings net.set_options(""" var options = { "physics": { "forceAtlas2Based": { "gravitationalConstant": -50, "centralGravity": 0.01, "springLength": 100 }, "solver": "forceAtlas2Based" } } """) net.save_graph(output_file) print(f"Interactive visualization saved to {output_file}") def plot_static( self, figsize: Tuple[int, int] = (16, 12), layout: str = 'spring' ) -> plt.Figure: """Create static matplotlib visualization.""" fig, ax = plt.subplots(figsize=figsize) # Calculate layout if layout == 'spring': pos = nx.spring_layout(self.graph, k=2, iterations=50) elif layout == 'kamada_kawai': pos = nx.kamada_kawai_layout(self.graph) else: pos = nx.circular_layout(self.graph) # Node sizes based on degree node_sizes = [self.graph.degree(n) * 50 + 100 for n in self.graph.nodes()] # Node colors based on community if self.communities: colors = [self.communities.get(n, 0) for n in self.graph.nodes()] else: colors = '#FF4500' # Draw network nx.draw_networkx_edges(self.graph, pos, alpha=0.3, ax=ax) nx.draw_networkx_nodes( self.graph, pos, node_size=node_sizes, node_color=colors, cmap=plt.cm.Set3, ax=ax ) nx.draw_networkx_labels( self.graph, pos, font_size=8, font_color='white', ax=ax ) ax.set_facecolor('#1A1A1B') fig.patch.set_facecolor('#1A1A1B') ax.axis('off') plt.tight_layout() return fig # Usage viz = SubredditNetworkVisualizer(G, communities) viz.create_interactive_viz('subreddit_network.html') fig = viz.plot_static() plt.show()
Pro Tip: Filtering for Clarity
Large networks are hard to visualize. Filter to the k-core (nodes with at least k connections) or top N nodes by centrality before visualization. This preserves the most important structure while improving readability.
Discover Related Subreddits
reddapi.dev provides automatic subreddit discovery based on semantic similarity. Find relevant communities for your topic without building network graphs.
Explore Subreddit DiscoveryNetwork Metrics Summary
| Metric | Measures | Reddit Interpretation |
|---|---|---|
| Degree Centrality | Number of direct connections | Well-connected subreddits |
| Betweenness Centrality | Control over information flow | Bridge between communities |
| Closeness Centrality | Average distance to all nodes | Central to the ecosystem |
| PageRank | Importance via link structure | Connected to important subreddits |
| Modularity | Community structure quality | How clustered the network is |
| Clustering Coefficient | Local connectivity | Tight-knit community groups |
Frequently Asked Questions
What's the best way to define subreddit relationships?
It depends on your question. For audience overlap, use shared users. For content flow, use crossposts. For topic similarity, use semantic embeddings of post content. User overlap is most common because it directly measures which subreddits share the same audience.
How do I handle very large subreddit networks?
For networks with 10k+ nodes: (1) filter to minimum edge weight threshold, (2) extract k-core (nodes with k+ connections), (3) sample edges while preserving structure, (4) use approximate algorithms (label propagation vs Louvain), or (5) focus analysis on specific communities of interest.
What modularity score indicates good community detection?
Modularity ranges from -0.5 to 1.0. Scores above 0.3 indicate significant community structure. Real-world networks typically have modularity between 0.3-0.7. Higher isn't always better—very high modularity may indicate overfitting to noise or too many small communities.
How do I interpret bridge subreddits?
Bridge subreddits (high betweenness) connect otherwise separate communities. They're often general-interest subreddits (r/AskReddit, r/pics) or subreddits that span multiple topics. These are valuable for understanding how information flows across Reddit and for finding audiences that span niches.
Can I predict which subreddits will form connections?
Yes, link prediction is possible using features like: shared neighbors (Jaccard similarity), topic similarity, user overlap trends, and network position. Machine learning models trained on historical edge formation can predict future connections, useful for subreddit recommendation systems.