Technical Guide

Analysis methods, models, and metrics behind BlogTracker

16 sections

System Overview

What Happens After You Create a Tracker

Once a tracker is created, BlogTracker automatically begins a background analysis workflow. This runs asynchronously and does not require you to stay on the page.

Pipeline Steps

Source pages are fetched and cleaned
Text is normalized into structured content
Posts are compared by semantic meaning
Insights are generated and stored

Data Pipeline

Content Collection & Processing

BlogTracker automatically collects new posts from your added sources and extracts the main article content while removing non-essential elements like navigation and ads.

Processing Steps

Posts collected automatically from RSS feeds and URLs
Article content extracted, navigation and ads removed
Text cleaned, standardized, and stored
Source and time metadata preserved
Consistent foundation created for clustering, sentiment, and narrative detection

Analysis

Sentiment Analysis

Analyzes the overall emotional tone of posts as positive, neutral, or negative. Goes beyond keywords by evaluating full sentence structure and contextual meaning, enabling tracking of emotional shifts across discussions over time.

How Sentiment Is Calculated

Each post is processed using a large language model that evaluates semantic context
The model considers word choice, phrasing, and sentence relationships together
A continuous sentiment score is produced on a negative-to-positive spectrum
Scores are normalized and mapped to clear sentiment labels

Models & Libraries Used

gpt-oss-120b - transformer-based language model used for contextual sentiment understanding
Semantic interpretation used instead of rule-based or keyword-only approaches

Sentiment Visualizations

1Sentiment Timeline: Tracks sentiment scores over time, highlighting emotional spikes and shifts aligned with major events
2Sentiment Heatmap: Displays individual posts across time, where color indicates sentiment polarity and bubble size represents posting volume
3Sentiment Distribution: Shows the proportion of positive, neutral, and negative posts for a high-level emotional snapshot
4Toxicity Levels: Groups posts by toxicity intensity to separate mild negativity from more extreme language

Psycholinguistic Radar Analysis

Extends sentiment analysis by examining how language is used, not just how it feels. Breaks down posts into cognitive, emotional, and behavioral dimensions visualized on radar charts.

Radar Dimensions

Toxicity: Detects insults, profanity, threats, identity attacks, and sexually explicit language
Personal Content: Measures references to work, money, home, religion, leisure, and death
Time Orientation: Identifies focus on past events, present urgency, or future predictions
Core Drives: Captures achievement, risk prevention, power, reward focus, and affiliation signals
Cognitive Process: Reflects certainty, tentativeness, insight, and causal thinking
Summary Variables: High-level indicators - analytical thinking, clout, authenticity, emotional tone
Emotion: Detects anger, anxiety, sadness, negative emotion, and positive emotion

Analysis

Posting Frequency Analysis

Tracks publishing patterns across blogs over time. Identifies peak activity periods, seasonal trends, and find important entities across blogs.

Temporal Analysis Methods

1Main Timeline Analysis: shows publication volume over time with blog-level breakdown
2Day-of-Week Patterns: reveals preferred publishing days across the week
3Monthly Aggregations: shows seasonal trends across all years combined
4Interactive Zooming: allows detailed examination of specific time periods
5Word Cloud: shows most common words in blogs
6Entity Extraction: identifies key people, places, and organizations mentioned across blogs

Interactive Features

Click legend items to filter specific blogs in/out of visualizations
Drag-select on timeline charts to zoom into specific date ranges
Stacked/overlay toggles for different visualization perspectives
Multiple granularity options (day, week, month, quarter)

How are entities extracted?

We use SpaCy's Named Entity Recognition model to identify and categorize named entities (people, organizations, locations, etc.) from input text
Each entity is displayed with its type and frequency across the tracker
We use SpaCy's most advanced pre-trained model (en_core_web_trf) which is based on transformer architecture for high accuracy in entity recognition

Export & Reporting

Full-page PDF export with all visualizations and insights
Individual chart saving to custom reports
Interactive data table with filtering and sorting capabilities
Cross-filtering between visualizations for deeper analysis

Analysis

Topic Distribution Analysis

Identifies and tracks themes across blog content using advanced topic modeling. Shows how topics distribute across different blogs and over time, and reveals relationships and co-occurrence between topics.

How Topics Are Calculated

For each Blog their posts analyzed using LLM + statistical topic modeling
Each post receives topic weights indicating multiple theme contributions
Topics automatically labeled based on keyword patterns and semantic meaning
Dominant topics identified for each post based on highest weight

Models & Libraries Used

Hybrid Approach: configurable LLM (`process.env.LLM_MODEL`, fallback `gpt-oss-120b`) for topic extraction and consolidation
scikit-learn: used for complex processes like topic modeling and weight calculations
D3.js: used for chord diagram visualization of topic relationships

Topic Relationships (Chord Diagram)

Visualizes how topics co-occur in the same posts
Arc length represents topic dominance across all content
Ribbon thickness indicates co-occurrence strength between topics
Interactive hover reveals detailed relationship percentages

Blog Topic Distribution Patterns

Small multiples show topic distribution across individual blogs
Stacked bar charts reveal each blog's thematic focus
Pagination handles analysis of large blog collections
Identifies blogs with specialized vs. broad content focus

Analysis

Blog Distribution Analysis

Analyzes publishing patterns across multiple blogs within a tracker. Combines posting frequency with topic dynamics analysis, enabling comparison between blogs' publishing behavior and topic dominance.

Analysis Levels

1Tracker-Wide Analysis - shows overall patterns across all blogs in the tracker
2Blog-Level Analysis - focuses on individual blog's publishing behavior and topic focus
3Topic-Specific Filtering - allows drilling down into specific thematic areas

Topic Dynamics Metrics

Novelty - measures how different content is from recent posts
Resonance - quantifies lasting impact and influence of content
Transience - tracks how quickly topics change after publication
Trend lines show topic evolution patterns over time

Advanced Metrics Display

Total post count with active blog breakdown
Topic dominance percentage showing thematic focus areas
Content freshness indicator showing days since latest publication
Real-time updates based on filter selections

Analysis

Blogger Distribution Analysis

Analyzes content patterns at the individual blogger/author level. Provides insights into individual content creator behaviors and writing styles, enabling comparison between different bloggers within the same tracker.

Analysis Levels

1Tracker-Wide Analysis - overall patterns across all bloggers in the tracker
2Blogger-Level Analysis - individual blogger's publishing behavior and topic focus
3Individual Performance Metrics - tracks each blogger's unique contribution patterns

Blogger Topic Dynamics Metrics

Novelty: how different each blogger's content is from their recent posts
Resonance: lasting impact and influence of each blogger's content
Transience: how quickly topics change after each blogger's publications
Trend lines show each blogger's topic evolution patterns over time

Publishing Pattern Analysis

Day-of-week patterns showing when bloggers prefer to publish
Monthly aggregated patterns revealing seasonal blogging behaviors
Interactive tooltips with blogger-specific statistics

Analysis

Keyword Analysis

Keyword analysis helps BlogTracker identify and track the most important topics and terms within a set of posts or blogs, making it easier to see which themes are emerging, gaining traction, or declining over time.

How Keywords Are Processed

Each post is processed to extract and count keywords along with their contexts
Frequency is measured across posts, blogs, and time periods
Total frequency, post count, and distinct blog count computed per keyword
Signals distinguish widely-discussed topics from source-concentrated ones

Aggregation & Trend Detection

Keyword data aggregated across selected timeframes or sources
Enables comparison of topics and observation of attention shifts over time
Highlights which topics matter most and how interest rises or falls
Reveals how narratives take shape across the broader content landscape

Analysis

Morality Analysis

Examines content through the lens of Moral Foundations Theory, identifying how different moral values are expressed across posts. Evaluates both virtues (positive moral foundations) and vices (negative counterparts).

Moral Foundations Framework

1Care/Harm - compassion and empathy vs. cruelty, violence, and neglect
2Fairness/Cheating - justice and rights vs. deception, corruption, and exploitation
3Loyalty/Betrayal - solidarity and patriotism vs. treachery and disloyalty
4Authority/Subversion - respect for tradition vs. rebellion, disrespect, and chaos
5Sanctity/Degradation - purity and sacredness vs. contamination and profanity

How Morality Scores Are Calculated

Each post analyzed using `gpt-oss-120b`
Model identifies moral-relevant language and assigns scores across all ten dimensions
Scores normalized to percentages for consistent comparison
Virtue scores represent positive moral expression; vice scores capture negative framing

Diverging Bar Chart Visualization

Virtues (positive values) extend to the right with green coloring
Vices (negative values) extend to the left with red coloring
The zero line represents moral neutrality
Chart width indicates the strength of moral emphasis

Post-Level Deep Analysis

Click any post to reveal a two-panel detailed view
Left Panel: full post content with sentiment, toxicity, and moral metadata
Right Panel: interactive radar chart showing the post's complete moral profile
Radar chart navigation allows switching between moral dimension categories

Key Academic References

Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5), 1029–1046. https://doi.org/10.1037/a0015141

Okeke, O., Cakmak, M. C., Onyepunuka, U., Spann, B., & Agarwal, N. (2023). Evaluating emotion and morality bias in YouTube's recommendation algorithm for the China–Uyghur crisis. In Proceedings of the International Conference on Social, Behavioral, and Economic Sciences (SBP-BRiMS 2023), Pittsburgh, PA, United States.

Mbila-Uma, S., Umoga, I., Alassad, M., & Agarwal, N. (2023). Conducting morality and emotion analysis on South China Sea blog discourse. In Proceedings of the International Conference on Collaboration Technologies and Social Computing (CollabTech 2023), Osaka, Japan.

Analysis

Grievance Analysis

Measures grievance intensity across tracker posts using a continuous score, binary grievance labeling, confidence values, and multi-label category tags. The page is designed to help analysts see how grievance shifts over time, which themes dominate, which blogs use the strongest grievance framing, and which exact posts explain the pattern.

How Grievance Is Calculated

Each analyzed post receives a continuous grievance score on a 0 to 1 scale
A binary grievance label can be read directly or derived from the score when needed
Posts with score >= 0.5 are treated as grievance; lower scores are treated as non-grievance
Posts with score >= 0.7 are highlighted as high-grievance posts for faster analyst review
Each post can also carry grievance confidence and category tags for Social, Political, and Economic grievance

Main Visualizations

1Summary Cards - visible posts, analyzed posts, average grievance score, and high-grievance post count
2Monthly Trend Chart - switch between monthly grievance/non-grievance counts and total post volume plus average grievance score
3Score Distribution - histogram of analyzed posts across five score bands from 0.0-0.2 through 0.8-1.0
4Category Share - donut chart showing how often Social, Political, and Economic grievance tags appear
5Monthly Category Activity - stacked area chart showing which grievance categories are most active over time
6Category Overlap - Venn diagram showing exclusive overlap regions across the three grievance categories
7Top Blogs by Avg Grievance Score - ranks blogs by how strongly they frame content in grievance terms

Interactive Features

Date range filtering updates all grievance metrics and charts together
Clicking a score-distribution bar filters the monthly trend chart to that score band
Hover tooltips expose per-month counts, rates, and average score values
The analyzed-posts table supports sorting by title, score, confidence, and date
Clicking a table row opens a post reader with the model explanation, categories, confidence, and full content
Full-page PDF export is available from the page header

Category Overlap Logic

The Venn diagram uses exclusive region counts, not total category totals
For example, Social means Social only, while Social + Political means those two tags without Economic
Unassigned represents analyzed posts with no grievance category tag at all
All exclusive regions plus Unassigned should sum to the total analyzed-post count

Post-Level Evidence Layer

The bottom table and selected-post reader are the evidence layer of the page. They let analysts move from chart-level patterns to the exact posts that produced those patterns, including risk badge, grievance score, confidence, category tags, explanation text, and original source link.

Influence

Influence Analysis

Quantifies the persuasive impact and reach of bloggers and their content by combining engagement metrics with AI-assessed content quality.

How Influence Score Is Calculated

The influence score combines quantitative engagement metrics with qualitative content assessment to produce a holistic measure of a post's impact. Each post receives an individual score, which is then aggregated to the blogger and blog levels for comparative analysis.

Post-Level Formula

influence_score = (num_inlinks - num_outlinks + num_comments) × content_quality_score

Where content_quality_score is an LLM-assessed value between 0.0 and 1.0 based on persuasive writing quality.

Content Quality Assessment

Each post is evaluated by a configured large language model for persuasive power and articulation
Assessment dimensions include persuasive power, intellectual depth, readability, authenticity, and memorability
Scores range from 0.0 (negligible impact) to 1.0 (exceptionally influential)
The quality score acts as a multiplier, amplifying the influence of well-crafted content
Evaluation focuses purely on content substance, ignoring metadata or source popularity

Aggregation Logic

Blogger influence: Average of all their posts' influence scores
Blog influence: Average influence of all posts on that blog

Core Visualizations

1Influence Over Time: Area chart tracking individual blogger influence scores across customizable time grains (daily, monthly, quarterly, yearly)
2Blogger Activity vs Influence: Scatter plot positioning bloggers by post volume against influence score, with quadrant analysis highlighting high-impact voices
3Top Keywords: Word cloud visualization showing the most frequent and significant terms used by selected bloggers
4Posts Table: Sortable list of all posts with filtering by blogger, date range, and search terms

Quadrant Analysis

High Activity · High Influence: Key opinion leaders driving both volume and impact
Low Activity · High Influence: Selective but highly persuasive voices (quality over quantity)
High Activity · Low Influence: ↑ Prolific but less impactful contributors
Low Activity · Low Influence: ↓ Emerging or peripheral voices

Interactive Features

Multi-blogger selection: Compare influence trends across multiple bloggers simultaneously
Time grain control: Toggle between daily, monthly, quarterly, and yearly aggregation
Zoom functionality: Click and drag on the main chart to focus on specific time ranges
Click-to-select: Select bloggers directly from the scatter plot or legend
Post reader: View full post content alongside the analysis table

Analysis

Blog Network Analysis

Maps the connections between blogs in two ways: by cross-linking behavior and by shared named entities. Reveals which blogs reference each other, what subjects they cover in common, and how tightly they cluster around real-world people, organizations, and places.

Two View Modes

1Link Graph — shows directed connections between blogs based on how many posts from one blog link to another. Arrow direction tells you which blog did the linking.
2Entity Network — connects blogs that write about the same named entities (people, organizations, locations). Edge strength is measured by Jaccard similarity between their entity sets.

How the Link Graph Works

Each blog is a node; each arrow is a directed edge from the linking blog to the linked blog
Edge weight = number of posts that contain the cross-blog link
Number displayed on each arrow shows the post count
Bidirectional links are drawn with a curve so both directions are visible
Click any edge to see which specific posts created that connection, with dates and full post reader

How the Entity Network Works

Named entities (people, organizations, locations) are extracted from every post using SpaCy
Two blogs are connected if they share at least one entity
Edge strength is measured by Jaccard similarity between each blog's entity set
Higher similarity (closer to 1.0) means the blogs consistently cover the same real-world subjects
Click any edge to see the shared entities, then click an entity to open knowledge-graph data and linked posts

Jaccard Similarity Formula

J(A, B) = |A ∩ B| / |A ∪ B|

A and B are the entity sets for two blogs. |A ∩ B| is the number of shared entities; |A ∪ B| is the total unique entities across both blogs. Result ranges from 0 (no overlap) to 1 (identical entity coverage).

Similarity Score Histogram

Shows the distribution of Jaccard scores across all blog pairs, split into 10 equal bins from 0 to 1
Helps you see at a glance whether blog pairs are loosely or tightly connected
Use the Min Similarity slider to filter out weak connections and focus on the strongest overlaps

Interactive Features

1Blog Filter — select a subset of blogs to reduce graph clutter and focus on specific sources
2Date Range Filter — narrow both graphs to connections that appeared within a chosen time window
3Entity Drilldown — click a shared entity to view knowledge-graph data (type, description, categories, relationships) and all posts from both blogs mentioning it
4Post Reader — open any linked post in a full reader panel without leaving the page
5PDF Export and individual chart saving to custom reports via the toolbar

Models & Libraries Used

SpaCy (en_core_web_trf) — transformer-based Named Entity Recognition for extracting people, organizations, and locations from post content
react-force-graph-2d — physics-based force-directed graph layout rendered on canvas
D3.js force simulation — charge, link distance, and centering forces for readable node positioning

Analysis

Clustering Analysis

Groups posts based on semantic meaning rather than shared keywords. Brings together discussions that use different language to express similar ideas, and automatically determines the optimal number of discussion groups.

How Clusters Are Calculated

Each post is embedded into a dense vector x_i that captures semantic meaning
Vectors are L2-normalized: x_hat_i = x_i / ||x_i|| so cosine comparisons are stable
Candidate values of K are adaptively tested and scored with silhouette score
KMeans (or MiniBatchKMeans for larger datasets) assigns each post to cluster c_i and minimizes within-cluster distance
Each cluster center (centroid) is recomputed as the mean vector of its assigned posts
Post-processing refines groups by splitting incoherent clusters and merging near-duplicate clusters

KMeans Objective Function

J = Sum_i ||x_i - mu_(c_i)||^2

KMeans chooses assignments c_i and centroids mu_k that minimize total within-cluster variance.

Centroid Formula

mu_k = (1 / N_k) * Sum_(i in C_k) x_i

mu_k is the centroid of cluster k, C_k is the set of posts in that cluster, and N_k is cluster size.

Silhouette Score (Choosing K)

s(i) = (b(i) - a(i)) / max(a(i), b(i))

a(i) is average distance to the same cluster, b(i) is best average distance to a different cluster. Higher is better.

Models & Libraries Used

Qwen3 Embedding 8B - transformer model to represent posts based on semantic meaning
scikit-learn - KMeans/MiniBatchKMeans clustering, silhouette scoring, IncrementalPCA, and similarity calculations
NumPy - vector normalization and centroid computations
CountVectorizer + TF-IDF style weighting - primary keyword candidate scoring
KeyBERT - optional refinement over candidate keywords

How Keywords Are Generated

For each cluster, representative posts near the centroid are sampled
Candidate keyphrases are generated with c-TF-IDF-style scoring over cleaned n-grams
KeyBERT may optionally refine the candidate list
Keyword diversity and low-signal filtering are applied before final topic labels are stored

Graph 1: Cluster Landscape (Scatter Plot Axes)

High-dimensional post/cluster embeddings are projected to 2D using IncrementalPCA
X-axis = Thematic Dimension 1 (PCA component 1, no physical unit)
Y-axis = Thematic Dimension 2 (PCA component 2, no physical unit)
Points near each other are semantically similar; far points are semantically different
Large bubbles are cluster centroids; bubble size represents post volume in that cluster

Graph 2: Temporal Evolution (Line/Area Chart Axes)

X-axis = time (publication date bins)
Y-axis = number of posts assigned to each cluster at that time
Each colored series tracks how one cluster grows, shrinks, or stays stable over time

Semantic Similarity Formula

cos(mu_a, mu_b) = (mu_a . mu_b) / (||mu_a|| * ||mu_b||)

Used to measure semantic overlap between clusters a and b (range: -1 to 1, typically 0 to 1 after normalization).

Graph 3: Cluster Relationships (Chord Diagram)

Each outer arc is one cluster; larger arc span means higher total relationship weight
A ribbon between two arcs indicates semantic overlap between those two clusters
Ribbon thickness is proportional to cosine similarity strength
Very weak similarities are filtered so only meaningful connections are shown

Analysis

Narrative Detection

Identifies recurring interpretive frames within semantic clusters. Narratives are generated per cluster and linked to posts using embedding-based similarity scoring.

Input Structure

Narrative detection operates on previously computed semantic clusters
Each cluster contains a short summary, keywords, and associated posts
Cluster spread is computed as the average cluster_distance across posts

Narrative Mapping Flow

Cluster → Generate Narratives → Embed Narratives → Compare with Post Embeddings → Compute Cosine Similarity → Assign Post if similarity ≥ 0.35. Each post can link only to narratives within its original cluster.

Narrative Mapping Flow

1Start with one semantic cluster
2Generate structured narratives for that cluster
3Convert each narrative into an embedding vector
4Compare narrative vectors with post vectors using cosine similarity
5Assign a post to a narrative if similarity ≥ 0.35

Spread-Based Narrative Count

Spread_c = mean(cluster_distance_i)

Average semantic distance of posts from their cluster centroid determines cluster diversity.

Narrative Scaling Rule

If Spread < 0.08 → 2 narratives
If 0.08 ≤ Spread < 0.15 → 3 narratives
If 0.15 ≤ Spread < 0.22 → 4 narratives
If Spread ≥ 0.22 → 5 narratives

Narrative Representation

v_n = Embed(title + description + keywords)

Each generated narrative is converted into a dense embedding vector using the same embedding model applied to posts.

Post Representation

x_i = Embed(post_title + post_content)

Post embeddings are reused from clustering and represent semantic meaning.

Narrative–Post Similarity

cos(v_n, x_i) = (v_n · x_i) / (||v_n|| ||x_i||)

Cosine similarity measures semantic alignment between narrative and post.

Assignment Rule

Assign(n, i) if cos(v_n, x_i) ≥ 0.35

Only posts exceeding the similarity threshold (0.35) are linked to a narrative.

Models Used

Qwen3 Embedding 8B - semantic vector representations
Qwen3-VL 32B - generation of structured narrative objects per cluster
scikit-learn - cosine similarity computation

Method Summary

Narrative detection builds directly on semantic clustering. Clusters define coherent discussion groups. A language model generates structured narrative candidates per cluster. Narrative embeddings are then compared to post embeddings using cosine similarity. Posts are linked to narratives through a fixed similarity threshold, ensuring assignment is quantitatively grounded.

Feature

Reports & Exports

Reports turn dashboard and analysis visuals into saved evidence cards. Users can capture charts, attach structured summaries, organize them into named reports, share those reports with other users, and generate AI-written insight text for each saved item.

Report Creation Workflow

1Use the plus/bookmark-style report action on a chart or widget
2Choose Create new report or Add to existing report from the report menu
3The selected chart is captured as an image and stored as a report item
4A structured summary_json payload is saved alongside the image whenever the widget supports it
5Saved items are then available from the Reports page and the individual report detail page

What Gets Saved

Chart image snapshot captured from the live UI
Tracker identifier and chart key so the item keeps its source context
Structured summary data when the widget exposes summary_json
Human-readable chart label and tracker name on the report detail page
Per-item AI insight status, model name, generated timestamp, and any generation error

AI Insight Generation

Each report item can generate its own AI insight from the saved chart context
If structured summary_json exists, the insight pipeline can use that instead of relying only on the chart image
Generated insight cards include a heading, a short summary, key findings, and a caution note
Insights can be regenerated item-by-item from the report detail page
The report page also supports Generate All Insights for items that do not yet have an insight

Sharing Workflow

1Open a report and use the Share action
2Invite one or more users by email
3Recipients receive access through the report-sharing flow and notification system
4The report creator can review current shares and revoke access later

Export and Delivery Benefits

Preserves chart state at the moment of capture instead of forcing analysts to rerun a workflow later
Supports building curated evidence packets from multiple pages over time
Allows analysts to pair visuals with AI-generated narrative interpretation
Works well for collaboration because a report can be shared and then discussed as a fixed snapshot
Complements page-level PDF export by providing reusable report objects instead of one-time downloads

Methodology

Topic Analysis Methodology

Hybrid LDA + LLM pipeline: Latent Dirichlet Allocation probabilistically discovers topic clusters from all posts, modeling each document as a mixture of topics and each topic as a mixture of words. A single LLM call then names them. Topic weights are assigned via TF-IDF cosine similarity, and information-theoretic NRT metrics (Novelty, Resonance, Transience) measure how topics evolve over time. Multi-level cascade analysis at tracker, blog, and blogger levels with vectorized NumPy computation.

Analysis Pipeline

1Data Collection — retrieve posts with full text, metadata, and temporal ordering. HTML, JS, CSS, ad-tech tokens, and encoded strings are stripped during preprocessing
2LDA Topic Discovery — build a term-frequency matrix (max_features=8000, ngram_range=(1,2)) from all posts via CountVectorizer, then fit Latent Dirichlet Allocation (online learning, doc_topic_prior=1/k, topic_word_prior=0.01) to discover 5 topic clusters with top-15 keywords each
3LLM Topic Naming — a single LLM call receives keyword lists and returns concise 2–4 word labels for each topic. If the LLM fails, auto-labels are generated from the top keywords
4Topic Distribution — uses the fitted LDA model's document-topic distributions directly. Each post gets a probability vector over topics, normalized to sum to 1. For cascade reuse, subsets index into the pre-computed distribution matrix
5NRT Dynamics — compute Novelty, Resonance, and Transience for each post using vectorized all-pairs KL divergence with a sliding window of 20 posts
6Database Storage — batch INSERT results with cascade support: tracker → blog → blogger, reusing tracker-level topics downstream

LDA Generative Model

P(w|d) = Σₖ P(w|zₖ) · P(zₖ|d)

Each document d is a mixture of topics z with Dirichlet prior α, and each topic is a distribution over words with Dirichlet prior β. The model is fit via online variational Bayes for scalability.

LDA Topic Distribution

P(topic_k | post_i) = θ_ik where θ ~ Dir(α)

The document-topic distribution θ is inferred by LDA during fitting. Each row of the W matrix is a post's topic mixture, already a valid probability distribution.

KL Divergence

D_KL(P || Q) = Σᵢ P(i) · log₂(P(i) / Q(i))

All-pairs KL divergence is vectorized via NumPy broadcasting.

NRT Metric Definitions

1Novelty(i) = mean( D_KL(post_i || post_j) ) for j in [max(0, i−20), i) — average divergence from the preceding 20 posts. Higher values indicate greater departure from recent discussion
2Transience(i) = mean( D_KL(post_j || post_i) ) for j in (i, min(n, i+21)] — average divergence of following posts from the current post. Higher values indicate less lasting influence
3Resonance(i) = Novelty(i) − Transience(i) — positive values indicate new topics that persist in subsequent discourse; negative values indicate topics that fade quickly

What NRT Patterns Reveal

High Novelty + High Resonance — posts introducing new, lasting topics (topic leadership)
High Novelty + Low Resonance — posts introducing topics that don't catch on (failed innovations)
Low Novelty + Low Transience — posts reinforcing existing stable topics (discussion maintenance)
High Transience — topics that appear briefly then disappear (trends or noise)

Cascade Analysis System

1Tracker-Level Analysis — LDA discovers broad topics across all blogs in the tracker and produces document-topic distributions for every post in a single fit
2Blog-Level Analysis — reuses tracker-level topics (cascade_reuse_topics=True), slices the pre-computed LDA distribution matrix per blog, and computes NRT dynamics independently per blog's temporal ordering
3Blogger-Level Analysis — same cascade reuse, groups posts by author, computes per-blogger topic distributions and NRT metrics
4Parallel Processing — blogs and bloggers are analyzed simultaneously via ThreadPoolExecutor with configurable worker count (default 20 threads)

Models & Libraries

Configurable LLM (default `gemma3-27b`) — used only for topic naming (single call per analysis); falls back to auto-generated labels from LDA keywords if unavailable
scikit-learn — term-frequency vectorization (`CountVectorizer`) for LDA input, Latent Dirichlet Allocation (`LatentDirichletAllocation`) for both topic discovery and post-topic weight assignment
NumPy/SciPy — vectorized KL divergence computation and sparse matrix operations for memory-efficient processing of 100k–200k posts

Academic Foundations

Stine, Z. K., & Agarwal, N.

Barron, A. T. J., Huang, J., Spang, R. L., & DeDeo, S.

Ready to explore?

Apply these analytical methods to your own content sources.

Go to Dashboard