Network Analysis Graph Theory Community Detection Python

Network Analysis of Subreddits: Graph Theory and Community Detection

By @network_scientist | February 19, 2026 | 24 min read

Reddit's ecosystem of interconnected communities forms a rich network structure. By analyzing relationships between subreddits—through cross-posting, shared users, and topic similarity—we can discover hidden connections, influential hubs, and community clusters.

What You'll Learn

Building subreddit relationship graphs, centrality measures, community detection algorithms, and visualization techniques using NetworkX and Python.

Types of Subreddit Networks

Network Type Edge Definition Use Case
Cross-posting network Posts shared between subreddits Content flow analysis
User overlap network Shared active users Audience similarity
Comment reference network Subreddit mentions in comments Community awareness
Topic similarity network Semantic similarity of content Topic clustering
Moderation network Shared moderators Governance patterns

Building the Network

$ pip install networkx python-louvain pyvis
Successfully installed networkx-3.2.1 python-louvain-0.16
import networkx as nx
import pandas as pd
import numpy as np
from collections import defaultdict
from typing import List, Dict, Tuple

class SubredditNetworkBuilder:
    """Build subreddit relationship networks from Reddit data."""

    def __init__(self):
        self.graph = nx.Graph()

    def build_crosspost_network(
        self,
        posts_df: pd.DataFrame,
        min_crossposts: int = 5
    ) -> nx.Graph:
        """
        Build network from crosspost relationships.

        Args:
            posts_df: DataFrame with 'subreddit' and 'crosspost_parent_list' columns
            min_crossposts: Minimum crossposts to create edge
        """
        edge_weights = defaultdict(int)

        for _, post in posts_df.iterrows():
            source = post['subreddit']

            # Extract crosspost destinations
            if post.get('crosspost_parent_list'):
                for parent in post['crosspost_parent_list']:
                    target = parent.get('subreddit')
                    if target and source != target:
                        edge = tuple(sorted([source, target]))
                        edge_weights[edge] += 1

        # Create graph with filtered edges
        self.graph = nx.Graph()
        for (source, target), weight in edge_weights.items():
            if weight >= min_crossposts:
                self.graph.add_edge(source, target, weight=weight)

        print(f"Network: {self.graph.number_of_nodes()} nodes, {self.graph.number_of_edges()} edges")
        return self.graph

    def build_user_overlap_network(
        self,
        posts_df: pd.DataFrame,
        min_shared_users: int = 10
    ) -> nx.Graph:
        """
        Build network based on shared users between subreddits.

        Args:
            posts_df: DataFrame with 'subreddit' and 'author' columns
            min_shared_users: Minimum shared users to create edge
        """
        # Get users per subreddit
        subreddit_users = posts_df.groupby('subreddit')['author'].apply(set).to_dict()
        subreddits = list(subreddit_users.keys())

        self.graph = nx.Graph()

        # Calculate overlaps
        for i, sub1 in enumerate(subreddits):
            for sub2 in subreddits[i+1:]:
                overlap = len(subreddit_users[sub1] & subreddit_users[sub2])
                if overlap >= min_shared_users:
                    # Jaccard similarity for normalization
                    union_size = len(subreddit_users[sub1] | subreddit_users[sub2])
                    similarity = overlap / union_size if union_size > 0 else 0

                    self.graph.add_edge(
                        sub1, sub2,
                        weight=overlap,
                        similarity=similarity
                    )

        print(f"Network: {self.graph.number_of_nodes()} nodes, {self.graph.number_of_edges()} edges")
        return self.graph

    def build_mention_network(
        self,
        comments_df: pd.DataFrame,
        min_mentions: int = 3
    ) -> nx.DiGraph:
        """
        Build directed network from subreddit mentions in comments.

        Args:
            comments_df: DataFrame with 'subreddit' and 'body' columns
            min_mentions: Minimum mentions to create edge
        """
        import re

        edge_weights = defaultdict(int)
        pattern = re.compile(r'r/([a-zA-Z0-9_]+)')

        for _, comment in comments_df.iterrows():
            source = comment['subreddit']
            body = comment.get('body', '')

            mentions = pattern.findall(body)
            for mention in mentions:
                if mention.lower() != source.lower():
                    edge_weights[(source, mention)] += 1

        # Create directed graph
        self.graph = nx.DiGraph()
        for (source, target), weight in edge_weights.items():
            if weight >= min_mentions:
                self.graph.add_edge(source, target, weight=weight)

        print(f"Directed Network: {self.graph.number_of_nodes()} nodes, {self.graph.number_of_edges()} edges")
        return self.graph

    def add_node_attributes(
        self,
        subreddit_stats: pd.DataFrame
    ):
        """Add subreddit metadata as node attributes."""
        for _, row in subreddit_stats.iterrows():
            sub = row['subreddit']
            if sub in self.graph.nodes():
                self.graph.nodes[sub]['subscribers'] = row.get('subscribers', 0)
                self.graph.nodes[sub]['post_count'] = row.get('post_count', 0)
                self.graph.nodes[sub]['category'] = row.get('category', 'unknown')

# Usage
builder = SubredditNetworkBuilder()
G = builder.build_user_overlap_network(posts_df, min_shared_users=20)

Centrality Analysis

Identify influential subreddits using different centrality measures:

class SubredditCentralityAnalyzer:
    """Analyze subreddit importance using centrality measures."""

    def __init__(self, graph: nx.Graph):
        self.graph = graph
        self.centralities = {}

    def calculate_all_centralities(self) -> pd.DataFrame:
        """Calculate multiple centrality measures."""

        # Degree centrality: number of connections
        self.centralities['degree'] = nx.degree_centrality(self.graph)

        # Weighted degree (strength)
        self.centralities['strength'] = dict(
            self.graph.degree(weight='weight')
        )

        # Betweenness: bridges between communities
        self.centralities['betweenness'] = nx.betweenness_centrality(self.graph)

        # Closeness: average distance to all nodes
        self.centralities['closeness'] = nx.closeness_centrality(self.graph)

        # Eigenvector: connected to important nodes
        try:
            self.centralities['eigenvector'] = nx.eigenvector_centrality(self.graph, max_iter=1000)
        except:
            self.centralities['eigenvector'] = {n: 0 for n in self.graph.nodes()}

        # PageRank: importance via link structure
        self.centralities['pagerank'] = nx.pagerank(self.graph, weight='weight')

        # Combine into DataFrame
        df = pd.DataFrame({
            'subreddit': list(self.graph.nodes()),
            'degree': [self.centralities['degree'][n] for n in self.graph.nodes()],
            'strength': [self.centralities['strength'][n] for n in self.graph.nodes()],
            'betweenness': [self.centralities['betweenness'][n] for n in self.graph.nodes()],
            'closeness': [self.centralities['closeness'][n] for n in self.graph.nodes()],
            'eigenvector': [self.centralities['eigenvector'][n] for n in self.graph.nodes()],
            'pagerank': [self.centralities['pagerank'][n] for n in self.graph.nodes()]
        })

        return df

    def get_top_subreddits(
        self,
        metric: str = 'pagerank',
        n: int = 20
    ) -> pd.DataFrame:
        """Get top subreddits by specified metric."""
        df = self.calculate_all_centralities()
        return df.nlargest(n, metric)[['subreddit', metric]]

    def identify_bridge_subreddits(self) -> List[str]:
        """Find subreddits that connect different communities."""
        betweenness = self.centralities.get('betweenness')
        if betweenness is None:
            betweenness = nx.betweenness_centrality(self.graph)

        threshold = np.percentile(list(betweenness.values()), 90)
        bridges = [n for n, v in betweenness.items() if v > threshold]

        return bridges

# Usage
analyzer = SubredditCentralityAnalyzer(G)
centralities_df = analyzer.calculate_all_centralities()

print("Top subreddits by PageRank:")
print(analyzer.get_top_subreddits('pagerank', 10))

print("\\nBridge subreddits:")
print(analyzer.identify_bridge_subreddits())

Community Detection

Discover clusters of related subreddits:

import community as community_louvain

class SubredditCommunityDetector:
    """Detect communities in subreddit networks."""

    def __init__(self, graph: nx.Graph):
        self.graph = graph
        self.communities = {}

    def louvain_detection(self, resolution: float = 1.0) -> Dict[str, int]:
        """
        Detect communities using Louvain algorithm.

        Args:
            resolution: Higher = more communities
        """
        partition = community_louvain.best_partition(
            self.graph,
            resolution=resolution,
            random_state=42
        )

        self.communities['louvain'] = partition
        return partition

    def label_propagation(self) -> Dict[str, int]:
        """Detect communities using label propagation."""
        communities_generator = nx.algorithms.community.label_propagation_communities(self.graph)

        partition = {}
        for idx, community in enumerate(communities_generator):
            for node in community:
                partition[node] = idx

        self.communities['label_prop'] = partition
        return partition

    def get_community_summary(self, method: str = 'louvain') -> pd.DataFrame:
        """Summarize detected communities."""
        partition = self.communities.get(method)
        if partition is None:
            raise ValueError(f"Run {method} detection first")

        # Group subreddits by community
        community_members = defaultdict(list)
        for sub, comm in partition.items():
            community_members[comm].append(sub)

        # Create summary
        summaries = []
        for comm_id, members in community_members.items():
            # Calculate community metrics
            subgraph = self.graph.subgraph(members)

            summaries.append({
                'community_id': comm_id,
                'size': len(members),
                'members': ', '.join(members[:5]) + ('...' if len(members) > 5 else ''),
                'density': nx.density(subgraph),
                'internal_edges': subgraph.number_of_edges()
            })

        return pd.DataFrame(summaries).sort_values('size', ascending=False)

    def calculate_modularity(self, method: str = 'louvain') -> float:
        """Calculate modularity score for partition quality."""
        partition = self.communities.get(method)
        if partition is None:
            raise ValueError(f"Run {method} detection first")

        # Convert to community sets format
        community_sets = defaultdict(set)
        for node, comm in partition.items():
            community_sets[comm].add(node)

        communities_list = list(community_sets.values())
        modularity = nx.algorithms.community.modularity(self.graph, communities_list)

        return modularity

    def find_similar_subreddits(
        self,
        subreddit: str,
        method: str = 'louvain'
    ) -> List[str]:
        """Find subreddits in the same community."""
        partition = self.communities.get(method)
        if partition is None:
            raise ValueError(f"Run {method} detection first")

        if subreddit not in partition:
            return []

        target_community = partition[subreddit]
        similar = [s for s, c in partition.items() if c == target_community and s != subreddit]

        return similar

# Usage
detector = SubredditCommunityDetector(G)
communities = detector.louvain_detection(resolution=1.0)

print(f"Modularity: {detector.calculate_modularity():.3f}")
print("\\nCommunity Summary:")
print(detector.get_community_summary())

# Find similar subreddits
similar = detector.find_similar_subreddits('python')
print(f"\\nSubreddits similar to r/python: {similar[:10]}")

Network Visualization

from pyvis.network import Network
import matplotlib.pyplot as plt

class SubredditNetworkVisualizer:
    """Visualize subreddit networks."""

    def __init__(self, graph: nx.Graph, communities: Dict[str, int] = None):
        self.graph = graph
        self.communities = communities

    def create_interactive_viz(
        self,
        output_file: str = 'subreddit_network.html',
        height: str = '800px'
    ):
        """Create interactive HTML visualization using pyvis."""
        net = Network(height=height, width='100%', bgcolor='#1A1A1B', font_color='white')

        # Color palette for communities
        colors = [
            '#FF4500', '#0079D3', '#00A6A5', '#FF8717',
            '#7193FF', '#46D160', '#FF66AC', '#FFD635'
        ]

        # Add nodes
        for node in self.graph.nodes():
            size = min(self.graph.degree(node) * 2 + 10, 50)

            if self.communities:
                comm_id = self.communities.get(node, 0)
                color = colors[comm_id % len(colors)]
            else:
                color = '#FF4500'

            net.add_node(
                node,
                label=node,
                size=size,
                color=color,
                title=f"r/{node}\\nConnections: {self.graph.degree(node)}"
            )

        # Add edges
        for source, target, data in self.graph.edges(data=True):
            weight = data.get('weight', 1)
            net.add_edge(source, target, value=weight)

        # Physics settings
        net.set_options("""
        var options = {
            "physics": {
                "forceAtlas2Based": {
                    "gravitationalConstant": -50,
                    "centralGravity": 0.01,
                    "springLength": 100
                },
                "solver": "forceAtlas2Based"
            }
        }
        """)

        net.save_graph(output_file)
        print(f"Interactive visualization saved to {output_file}")

    def plot_static(
        self,
        figsize: Tuple[int, int] = (16, 12),
        layout: str = 'spring'
    ) -> plt.Figure:
        """Create static matplotlib visualization."""
        fig, ax = plt.subplots(figsize=figsize)

        # Calculate layout
        if layout == 'spring':
            pos = nx.spring_layout(self.graph, k=2, iterations=50)
        elif layout == 'kamada_kawai':
            pos = nx.kamada_kawai_layout(self.graph)
        else:
            pos = nx.circular_layout(self.graph)

        # Node sizes based on degree
        node_sizes = [self.graph.degree(n) * 50 + 100 for n in self.graph.nodes()]

        # Node colors based on community
        if self.communities:
            colors = [self.communities.get(n, 0) for n in self.graph.nodes()]
        else:
            colors = '#FF4500'

        # Draw network
        nx.draw_networkx_edges(self.graph, pos, alpha=0.3, ax=ax)
        nx.draw_networkx_nodes(
            self.graph, pos,
            node_size=node_sizes,
            node_color=colors,
            cmap=plt.cm.Set3,
            ax=ax
        )
        nx.draw_networkx_labels(
            self.graph, pos,
            font_size=8,
            font_color='white',
            ax=ax
        )

        ax.set_facecolor('#1A1A1B')
        fig.patch.set_facecolor('#1A1A1B')
        ax.axis('off')

        plt.tight_layout()
        return fig

# Usage
viz = SubredditNetworkVisualizer(G, communities)
viz.create_interactive_viz('subreddit_network.html')
fig = viz.plot_static()
plt.show()

Pro Tip: Filtering for Clarity

Large networks are hard to visualize. Filter to the k-core (nodes with at least k connections) or top N nodes by centrality before visualization. This preserves the most important structure while improving readability.

Discover Related Subreddits

reddapi.dev provides automatic subreddit discovery based on semantic similarity. Find relevant communities for your topic without building network graphs.

Explore Subreddit Discovery

Network Metrics Summary

Metric Measures Reddit Interpretation
Degree Centrality Number of direct connections Well-connected subreddits
Betweenness Centrality Control over information flow Bridge between communities
Closeness Centrality Average distance to all nodes Central to the ecosystem
PageRank Importance via link structure Connected to important subreddits
Modularity Community structure quality How clustered the network is
Clustering Coefficient Local connectivity Tight-knit community groups

Frequently Asked Questions

What's the best way to define subreddit relationships?

It depends on your question. For audience overlap, use shared users. For content flow, use crossposts. For topic similarity, use semantic embeddings of post content. User overlap is most common because it directly measures which subreddits share the same audience.

How do I handle very large subreddit networks?

For networks with 10k+ nodes: (1) filter to minimum edge weight threshold, (2) extract k-core (nodes with k+ connections), (3) sample edges while preserving structure, (4) use approximate algorithms (label propagation vs Louvain), or (5) focus analysis on specific communities of interest.

What modularity score indicates good community detection?

Modularity ranges from -0.5 to 1.0. Scores above 0.3 indicate significant community structure. Real-world networks typically have modularity between 0.3-0.7. Higher isn't always better—very high modularity may indicate overfitting to noise or too many small communities.

How do I interpret bridge subreddits?

Bridge subreddits (high betweenness) connect otherwise separate communities. They're often general-interest subreddits (r/AskReddit, r/pics) or subreddits that span multiple topics. These are valuable for understanding how information flows across Reddit and for finding audiences that span niches.

Can I predict which subreddits will form connections?

Yes, link prediction is possible using features like: shared neighbors (Jaccard similarity), topic similarity, user overlap trends, and network position. Machine learning models trained on historical edge formation can predict future connections, useful for subreddit recommendation systems.