Technical Guide
Analysis methods, models, and metrics behind BlogTracker
What Happens After You Create a Tracker
Once a tracker is created, BlogTracker automatically begins a background analysis workflow. This runs asynchronously and does not require you to stay on the page.
Pipeline Steps
- Source pages are fetched and cleaned
- Text is normalized into structured content
- Posts are compared by semantic meaning
- Insights are generated and stored
Content Collection & Processing
BlogTracker automatically collects new posts from your added sources and extracts the main article content while removing non-essential elements like navigation and ads.
Processing Steps
- Posts collected automatically from RSS feeds and URLs
- Article content extracted, navigation and ads removed
- Text cleaned, standardized, and stored
- Source and time metadata preserved
- Consistent foundation created for clustering, sentiment, and narrative detection
Sentiment Analysis
Analyzes the overall emotional tone of posts as positive, neutral, or negative. Goes beyond keywords by evaluating full sentence structure and contextual meaning, enabling tracking of emotional shifts across discussions over time.
How Sentiment Is Calculated
- Each post is processed using a large language model that evaluates semantic context
- The model considers word choice, phrasing, and sentence relationships together
- A continuous sentiment score is produced on a negative-to-positive spectrum
- Scores are normalized and mapped to clear sentiment labels
Models & Libraries Used
- gpt-oss-120b - transformer-based language model used for contextual sentiment understanding
- Semantic interpretation used instead of rule-based or keyword-only approaches
Sentiment Visualizations
- 1Sentiment Timeline: Tracks sentiment scores over time, highlighting emotional spikes and shifts aligned with major events
- 2Sentiment Heatmap: Displays individual posts across time, where color indicates sentiment polarity and bubble size represents posting volume
- 3Sentiment Distribution: Shows the proportion of positive, neutral, and negative posts for a high-level emotional snapshot
- 4Toxicity Levels: Groups posts by toxicity intensity to separate mild negativity from more extreme language
Psycholinguistic Radar Analysis
Extends sentiment analysis by examining how language is used, not just how it feels. Breaks down posts into cognitive, emotional, and behavioral dimensions visualized on radar charts.
Radar Dimensions
- Toxicity: Detects insults, profanity, threats, identity attacks, and sexually explicit language
- Personal Content: Measures references to work, money, home, religion, leisure, and death
- Time Orientation: Identifies focus on past events, present urgency, or future predictions
- Core Drives: Captures achievement, risk prevention, power, reward focus, and affiliation signals
- Cognitive Process: Reflects certainty, tentativeness, insight, and causal thinking
- Summary Variables: High-level indicators - analytical thinking, clout, authenticity, emotional tone
- Emotion: Detects anger, anxiety, sadness, negative emotion, and positive emotion
Posting Frequency Analysis
Tracks publishing patterns across blogs over time. Identifies peak activity periods, seasonal trends, and find important entities across blogs.
Temporal Analysis Methods
- 1Main Timeline Analysis: shows publication volume over time with blog-level breakdown
- 2Day-of-Week Patterns: reveals preferred publishing days across the week
- 3Monthly Aggregations: shows seasonal trends across all years combined
- 4Interactive Zooming: allows detailed examination of specific time periods
- 5Word Cloud: shows most common words in blogs
- 6Entity Extraction: identifies key people, places, and organizations mentioned across blogs
Interactive Features
- Click legend items to filter specific blogs in/out of visualizations
- Drag-select on timeline charts to zoom into specific date ranges
- Stacked/overlay toggles for different visualization perspectives
- Multiple granularity options (day, week, month, quarter)
How are entities extracted?
- We use SpaCy's Named Entity Recognition model to identify and categorize named entities (people, organizations, locations, etc.) from input text
- Each entity is displayed with its type and frequency across the tracker
- We use SpaCy's most advanced pre-trained model (en_core_web_trf) which is based on transformer architecture for high accuracy in entity recognition
Export & Reporting
- Full-page PDF export with all visualizations and insights
- Individual chart saving to custom reports
- Interactive data table with filtering and sorting capabilities
- Cross-filtering between visualizations for deeper analysis
Topic Distribution Analysis
Identifies and tracks themes across blog content using advanced topic modeling. Shows how topics distribute across different blogs and over time, and reveals relationships and co-occurrence between topics.
How Topics Are Calculated
- For each Blog their posts analyzed using LLM + statistical topic modeling
- Each post receives topic weights indicating multiple theme contributions
- Topics automatically labeled based on keyword patterns and semantic meaning
- Dominant topics identified for each post based on highest weight
Models & Libraries Used
- Hybrid Approach: configurable LLM (`process.env.LLM_MODEL`, fallback `gpt-oss-120b`) for topic extraction and consolidation
- scikit-learn: used for complex processes like topic modeling and weight calculations
- D3.js: used for chord diagram visualization of topic relationships
Topic Relationships (Chord Diagram)
- Visualizes how topics co-occur in the same posts
- Arc length represents topic dominance across all content
- Ribbon thickness indicates co-occurrence strength between topics
- Interactive hover reveals detailed relationship percentages
Blog Topic Distribution Patterns
- Small multiples show topic distribution across individual blogs
- Stacked bar charts reveal each blog's thematic focus
- Pagination handles analysis of large blog collections
- Identifies blogs with specialized vs. broad content focus
Blog Distribution Analysis
Analyzes publishing patterns across multiple blogs within a tracker. Combines posting frequency with topic dynamics analysis, enabling comparison between blogs' publishing behavior and topic dominance.
Analysis Levels
- 1Tracker-Wide Analysis - shows overall patterns across all blogs in the tracker
- 2Blog-Level Analysis - focuses on individual blog's publishing behavior and topic focus
- 3Topic-Specific Filtering - allows drilling down into specific thematic areas
Topic Dynamics Metrics
- Novelty - measures how different content is from recent posts
- Resonance - quantifies lasting impact and influence of content
- Transience - tracks how quickly topics change after publication
- Trend lines show topic evolution patterns over time
Advanced Metrics Display
- Total post count with active blog breakdown
- Topic dominance percentage showing thematic focus areas
- Content freshness indicator showing days since latest publication
- Real-time updates based on filter selections
Blogger Distribution Analysis
Analyzes content patterns at the individual blogger/author level. Provides insights into individual content creator behaviors and writing styles, enabling comparison between different bloggers within the same tracker.
Analysis Levels
- 1Tracker-Wide Analysis - overall patterns across all bloggers in the tracker
- 2Blogger-Level Analysis - individual blogger's publishing behavior and topic focus
- 3Individual Performance Metrics - tracks each blogger's unique contribution patterns
Blogger Topic Dynamics Metrics
- Novelty: how different each blogger's content is from their recent posts
- Resonance: lasting impact and influence of each blogger's content
- Transience: how quickly topics change after each blogger's publications
- Trend lines show each blogger's topic evolution patterns over time
Publishing Pattern Analysis
- Day-of-week patterns showing when bloggers prefer to publish
- Monthly aggregated patterns revealing seasonal blogging behaviors
- Interactive tooltips with blogger-specific statistics
Keyword Analysis
Keyword analysis helps BlogTracker identify and track the most important topics and terms within a set of posts or blogs, making it easier to see which themes are emerging, gaining traction, or declining over time.
How Keywords Are Processed
- Each post is processed to extract and count keywords along with their contexts
- Frequency is measured across posts, blogs, and time periods
- Total frequency, post count, and distinct blog count computed per keyword
- Signals distinguish widely-discussed topics from source-concentrated ones
Aggregation & Trend Detection
- Keyword data aggregated across selected timeframes or sources
- Enables comparison of topics and observation of attention shifts over time
- Highlights which topics matter most and how interest rises or falls
- Reveals how narratives take shape across the broader content landscape
Morality Analysis
Examines content through the lens of Moral Foundations Theory, identifying how different moral values are expressed across posts. Evaluates both virtues (positive moral foundations) and vices (negative counterparts).
Moral Foundations Framework
- 1Care/Harm - compassion and empathy vs. cruelty, violence, and neglect
- 2Fairness/Cheating - justice and rights vs. deception, corruption, and exploitation
- 3Loyalty/Betrayal - solidarity and patriotism vs. treachery and disloyalty
- 4Authority/Subversion - respect for tradition vs. rebellion, disrespect, and chaos
- 5Sanctity/Degradation - purity and sacredness vs. contamination and profanity
How Morality Scores Are Calculated
- Each post analyzed using `gpt-oss-120b`
- Model identifies moral-relevant language and assigns scores across all ten dimensions
- Scores normalized to percentages for consistent comparison
- Virtue scores represent positive moral expression; vice scores capture negative framing
Diverging Bar Chart Visualization
- Virtues (positive values) extend to the right with green coloring
- Vices (negative values) extend to the left with red coloring
- The zero line represents moral neutrality
- Chart width indicates the strength of moral emphasis
Post-Level Deep Analysis
- Click any post to reveal a two-panel detailed view
- Left Panel: full post content with sentiment, toxicity, and moral metadata
- Right Panel: interactive radar chart showing the post's complete moral profile
- Radar chart navigation allows switching between moral dimension categories
Key Academic References
Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5), 1029–1046. https://doi.org/10.1037/a0015141
Okeke, O., Cakmak, M. C., Onyepunuka, U., Spann, B., & Agarwal, N. (2023). Evaluating emotion and morality bias in YouTube's recommendation algorithm for the China–Uyghur crisis. In Proceedings of the International Conference on Social, Behavioral, and Economic Sciences (SBP-BRiMS 2023), Pittsburgh, PA, United States.
Mbila-Uma, S., Umoga, I., Alassad, M., & Agarwal, N. (2023). Conducting morality and emotion analysis on South China Sea blog discourse. In Proceedings of the International Conference on Collaboration Technologies and Social Computing (CollabTech 2023), Osaka, Japan.
Influence Analysis
Quantifies the persuasive impact and reach of bloggers and their content by combining engagement metrics with AI-assessed content quality.
How Influence Score Is Calculated
The influence score combines quantitative engagement metrics with qualitative content assessment to produce a holistic measure of a post's impact. Each post receives an individual score, which is then aggregated to the blogger and blog levels for comparative analysis.
Post-Level Formula
influence_score = (num_inlinks - num_outlinks + num_comments) × content_quality_score
Where content_quality_score is an LLM-assessed value between 0.0 and 1.0 based on persuasive writing quality.
Content Quality Assessment
- Each post is evaluated by a configured large language model for persuasive power and articulation
- Assessment dimensions include persuasive power, intellectual depth, readability, authenticity, and memorability
- Scores range from 0.0 (negligible impact) to 1.0 (exceptionally influential)
- The quality score acts as a multiplier, amplifying the influence of well-crafted content
- Evaluation focuses purely on content substance, ignoring metadata or source popularity
Aggregation Logic
- Blogger influence: Average of all their posts' influence scores
- Blog influence: Average influence of all posts on that blog
Core Visualizations
- 1Influence Over Time: Area chart tracking individual blogger influence scores across customizable time grains (daily, monthly, quarterly, yearly)
- 2Blogger Activity vs Influence: Scatter plot positioning bloggers by post volume against influence score, with quadrant analysis highlighting high-impact voices
- 3Top Keywords: Word cloud visualization showing the most frequent and significant terms used by selected bloggers
- 4Posts Table: Sortable list of all posts with filtering by blogger, date range, and search terms
Quadrant Analysis
- High Activity · High Influence: Key opinion leaders driving both volume and impact
- Low Activity · High Influence: Selective but highly persuasive voices (quality over quantity)
- High Activity · Low Influence: ↑ Prolific but less impactful contributors
- Low Activity · Low Influence: ↓ Emerging or peripheral voices
Interactive Features
- Multi-blogger selection: Compare influence trends across multiple bloggers simultaneously
- Time grain control: Toggle between daily, monthly, quarterly, and yearly aggregation
- Zoom functionality: Click and drag on the main chart to focus on specific time ranges
- Click-to-select: Select bloggers directly from the scatter plot or legend
- Post reader: View full post content alongside the analysis table
Blog Network Analysis
Maps the connections between blogs in two ways: by cross-linking behavior and by shared named entities. Reveals which blogs reference each other, what subjects they cover in common, and how tightly they cluster around real-world people, organizations, and places.
Two View Modes
- 1Link Graph — shows directed connections between blogs based on how many posts from one blog link to another. Arrow direction tells you which blog did the linking.
- 2Entity Network — connects blogs that write about the same named entities (people, organizations, locations). Edge strength is measured by Jaccard similarity between their entity sets.
How the Link Graph Works
- Each blog is a node; each arrow is a directed edge from the linking blog to the linked blog
- Edge weight = number of posts that contain the cross-blog link
- Number displayed on each arrow shows the post count
- Bidirectional links are drawn with a curve so both directions are visible
- Click any edge to see which specific posts created that connection, with dates and full post reader
How the Entity Network Works
- Named entities (people, organizations, locations) are extracted from every post using SpaCy
- Two blogs are connected if they share at least one entity
- Edge strength is measured by Jaccard similarity between each blog's entity set
- Higher similarity (closer to 1.0) means the blogs consistently cover the same real-world subjects
- Click any edge to see the shared entities, then click an entity to open knowledge-graph data and linked posts
Jaccard Similarity Formula
J(A, B) = |A ∩ B| / |A ∪ B|
A and B are the entity sets for two blogs. |A ∩ B| is the number of shared entities; |A ∪ B| is the total unique entities across both blogs. Result ranges from 0 (no overlap) to 1 (identical entity coverage).
Similarity Score Histogram
- Shows the distribution of Jaccard scores across all blog pairs, split into 10 equal bins from 0 to 1
- Helps you see at a glance whether blog pairs are loosely or tightly connected
- Use the Min Similarity slider to filter out weak connections and focus on the strongest overlaps
Interactive Features
- 1Blog Filter — select a subset of blogs to reduce graph clutter and focus on specific sources
- 2Date Range Filter — narrow both graphs to connections that appeared within a chosen time window
- 3Entity Drilldown — click a shared entity to view knowledge-graph data (type, description, categories, relationships) and all posts from both blogs mentioning it
- 4Post Reader — open any linked post in a full reader panel without leaving the page
- 5PDF Export and individual chart saving to custom reports via the toolbar
Models & Libraries Used
- SpaCy (en_core_web_trf) — transformer-based Named Entity Recognition for extracting people, organizations, and locations from post content
- react-force-graph-2d — physics-based force-directed graph layout rendered on canvas
- D3.js force simulation — charge, link distance, and centering forces for readable node positioning
Clustering Analysis
Groups posts based on semantic meaning rather than shared keywords. Brings together discussions that use different language to express similar ideas, and automatically determines the optimal number of discussion groups.
How Clusters Are Calculated
- Each post is embedded into a dense vector x_i that captures semantic meaning
- Vectors are L2-normalized: x_hat_i = x_i / ||x_i|| so cosine comparisons are stable
- Candidate values of K are adaptively tested and scored with silhouette score
- KMeans (or MiniBatchKMeans for larger datasets) assigns each post to cluster c_i and minimizes within-cluster distance
- Each cluster center (centroid) is recomputed as the mean vector of its assigned posts
- Post-processing refines groups by splitting incoherent clusters and merging near-duplicate clusters
KMeans Objective Function
J = Sum_i ||x_i - mu_(c_i)||^2
KMeans chooses assignments c_i and centroids mu_k that minimize total within-cluster variance.
Centroid Formula
mu_k = (1 / N_k) * Sum_(i in C_k) x_i
mu_k is the centroid of cluster k, C_k is the set of posts in that cluster, and N_k is cluster size.
Silhouette Score (Choosing K)
s(i) = (b(i) - a(i)) / max(a(i), b(i))
a(i) is average distance to the same cluster, b(i) is best average distance to a different cluster. Higher is better.
Models & Libraries Used
- Qwen3 Embedding 8B - transformer model to represent posts based on semantic meaning
- scikit-learn - KMeans/MiniBatchKMeans clustering, silhouette scoring, IncrementalPCA, and similarity calculations
- NumPy - vector normalization and centroid computations
- CountVectorizer + TF-IDF style weighting - primary keyword candidate scoring
- KeyBERT - optional refinement over candidate keywords
How Keywords Are Generated
- For each cluster, representative posts near the centroid are sampled
- Candidate keyphrases are generated with c-TF-IDF-style scoring over cleaned n-grams
- KeyBERT may optionally refine the candidate list
- Keyword diversity and low-signal filtering are applied before final topic labels are stored
Graph 1: Cluster Landscape (Scatter Plot Axes)
- High-dimensional post/cluster embeddings are projected to 2D using IncrementalPCA
- X-axis = Thematic Dimension 1 (PCA component 1, no physical unit)
- Y-axis = Thematic Dimension 2 (PCA component 2, no physical unit)
- Points near each other are semantically similar; far points are semantically different
- Large bubbles are cluster centroids; bubble size represents post volume in that cluster
Graph 2: Temporal Evolution (Line/Area Chart Axes)
- X-axis = time (publication date bins)
- Y-axis = number of posts assigned to each cluster at that time
- Each colored series tracks how one cluster grows, shrinks, or stays stable over time
Semantic Similarity Formula
cos(mu_a, mu_b) = (mu_a . mu_b) / (||mu_a|| * ||mu_b||)
Used to measure semantic overlap between clusters a and b (range: -1 to 1, typically 0 to 1 after normalization).
Graph 3: Cluster Relationships (Chord Diagram)
- Each outer arc is one cluster; larger arc span means higher total relationship weight
- A ribbon between two arcs indicates semantic overlap between those two clusters
- Ribbon thickness is proportional to cosine similarity strength
- Very weak similarities are filtered so only meaningful connections are shown
Narrative Detection
Identifies recurring interpretive frames within semantic clusters. Narratives are generated per cluster and linked to posts using embedding-based similarity scoring.
Input Structure
- Narrative detection operates on previously computed semantic clusters
- Each cluster contains a short summary, keywords, and associated posts
- Cluster spread is computed as the average cluster_distance across posts
Narrative Mapping Flow
Cluster → Generate Narratives → Embed Narratives → Compare with Post Embeddings → Compute Cosine Similarity → Assign Post if similarity ≥ 0.35. Each post can link only to narratives within its original cluster.
Narrative Mapping Flow
- 1Start with one semantic cluster
- 2Generate structured narratives for that cluster
- 3Convert each narrative into an embedding vector
- 4Compare narrative vectors with post vectors using cosine similarity
- 5Assign a post to a narrative if similarity ≥ 0.35
Spread-Based Narrative Count
Spread_c = mean(cluster_distance_i)
Average semantic distance of posts from their cluster centroid determines cluster diversity.
Narrative Scaling Rule
- If Spread < 0.08 → 2 narratives
- If 0.08 ≤ Spread < 0.15 → 3 narratives
- If 0.15 ≤ Spread < 0.22 → 4 narratives
- If Spread ≥ 0.22 → 5 narratives
Narrative Representation
v_n = Embed(title + description + keywords)
Each generated narrative is converted into a dense embedding vector using the same embedding model applied to posts.
Post Representation
x_i = Embed(post_title + post_content)
Post embeddings are reused from clustering and represent semantic meaning.
Narrative–Post Similarity
cos(v_n, x_i) = (v_n · x_i) / (||v_n|| ||x_i||)
Cosine similarity measures semantic alignment between narrative and post.
Assignment Rule
Assign(n, i) if cos(v_n, x_i) ≥ 0.35
Only posts exceeding the similarity threshold (0.35) are linked to a narrative.
Models Used
- Qwen3 Embedding 8B - semantic vector representations
- Qwen3-VL 32B - generation of structured narrative objects per cluster
- scikit-learn - cosine similarity computation
Method Summary
Narrative detection builds directly on semantic clustering. Clusters define coherent discussion groups. A language model generates structured narrative candidates per cluster. Narrative embeddings are then compared to post embeddings using cosine similarity. Posts are linked to narratives through a fixed similarity threshold, ensuring assignment is quantitatively grounded.
Reports & Exports
Reports let you save insights without rerunning analysis. You can export findings or build collections over time.
Report Workflow
- 1Click the bookmark icon on any chart to add it to a report
- 2Name and save your report from the Reports page
- 3Continue exploring or return to add more insights later
- 4Export as PDF to share with colleagues in one click
Key Benefits
- Save insights without rerunning the full analysis pipeline
- Build collections of charts over time for weekly or monthly roundups
- Share findings with colleagues via PDF export
- Snapshot-based - reports reflect data at time of saving
Topic Analysis Methodology
Hybrid LDA + LLM pipeline: Latent Dirichlet Allocation probabilistically discovers topic clusters from all posts, modeling each document as a mixture of topics and each topic as a mixture of words. A single LLM call then names them. Topic weights are assigned via TF-IDF cosine similarity, and information-theoretic NRT metrics (Novelty, Resonance, Transience) measure how topics evolve over time. Multi-level cascade analysis at tracker, blog, and blogger levels with vectorized NumPy computation.
Analysis Pipeline
- 1Data Collection — retrieve posts with full text, metadata, and temporal ordering. HTML, JS, CSS, ad-tech tokens, and encoded strings are stripped during preprocessing
- 2LDA Topic Discovery — build a term-frequency matrix (max_features=8000, ngram_range=(1,2)) from all posts via CountVectorizer, then fit Latent Dirichlet Allocation (online learning, doc_topic_prior=1/k, topic_word_prior=0.01) to discover 5 topic clusters with top-15 keywords each
- 3LLM Topic Naming — a single LLM call receives keyword lists and returns concise 2–4 word labels for each topic. If the LLM fails, auto-labels are generated from the top keywords
- 4Topic Distribution — uses the fitted LDA model's document-topic distributions directly. Each post gets a probability vector over topics, normalized to sum to 1. For cascade reuse, subsets index into the pre-computed distribution matrix
- 5NRT Dynamics — compute Novelty, Resonance, and Transience for each post using vectorized all-pairs KL divergence with a sliding window of 20 posts
- 6Database Storage — batch INSERT results with cascade support: tracker → blog → blogger, reusing tracker-level topics downstream
LDA Generative Model
P(w|d) = Σₖ P(w|zₖ) · P(zₖ|d)
Each document d is a mixture of topics z with Dirichlet prior α, and each topic is a distribution over words with Dirichlet prior β. The model is fit via online variational Bayes for scalability.
LDA Topic Distribution
P(topic_k | post_i) = θ_ik where θ ~ Dir(α)
The document-topic distribution θ is inferred by LDA during fitting. Each row of the W matrix is a post's topic mixture, already a valid probability distribution.
KL Divergence
D_KL(P || Q) = Σᵢ P(i) · log₂(P(i) / Q(i))
All-pairs KL divergence is vectorized via NumPy broadcasting.
NRT Metric Definitions
- 1Novelty(i) = mean( D_KL(post_i || post_j) ) for j in [max(0, i−20), i) — average divergence from the preceding 20 posts. Higher values indicate greater departure from recent discussion
- 2Transience(i) = mean( D_KL(post_j || post_i) ) for j in (i, min(n, i+21)] — average divergence of following posts from the current post. Higher values indicate less lasting influence
- 3Resonance(i) = Novelty(i) − Transience(i) — positive values indicate new topics that persist in subsequent discourse; negative values indicate topics that fade quickly
What NRT Patterns Reveal
- High Novelty + High Resonance — posts introducing new, lasting topics (topic leadership)
- High Novelty + Low Resonance — posts introducing topics that don't catch on (failed innovations)
- Low Novelty + Low Transience — posts reinforcing existing stable topics (discussion maintenance)
- High Transience — topics that appear briefly then disappear (trends or noise)
Cascade Analysis System
- 1Tracker-Level Analysis — LDA discovers broad topics across all blogs in the tracker and produces document-topic distributions for every post in a single fit
- 2Blog-Level Analysis — reuses tracker-level topics (cascade_reuse_topics=True), slices the pre-computed LDA distribution matrix per blog, and computes NRT dynamics independently per blog's temporal ordering
- 3Blogger-Level Analysis — same cascade reuse, groups posts by author, computes per-blogger topic distributions and NRT metrics
- 4Parallel Processing — blogs and bloggers are analyzed simultaneously via ThreadPoolExecutor with configurable worker count (default 20 threads)
Models & Libraries
- Configurable LLM (default `gemma3-27b`) — used only for topic naming (single call per analysis); falls back to auto-generated labels from LDA keywords if unavailable
- scikit-learn — term-frequency vectorization (`CountVectorizer`) for LDA input, Latent Dirichlet Allocation (`LatentDirichletAllocation`) for both topic discovery and post-topic weight assignment
- NumPy/SciPy — vectorized KL divergence computation and sparse matrix operations for memory-efficient processing of 100k–200k posts
Academic Foundations
Stine, Z. K., & Agarwal, N.
Barron, A. T. J., Huang, J., Spang, R. L., & DeDeo, S.
Apply these analytical methods to your own content sources.