Methods
This page summarizes how the data was collected, processed, and analyzed to build the Narrative and Explore views of Morning Pulse. The goal is transparency: you can see how raw transcripts become interpretable visual stories.
Data source
Morning Pulse uses official transcripts of Mexican presidential press conferences (“mañaneras”) and related public messages published on the Presidencia de México website.
Each record in the raw JSON includes:
- Date (in Spanish, e.g. “jueves, 16 de octubre de 2025”)
- Title of the speech or conference
- URL pointing to the original page
- Transcript: a list of paragraphs, each optionally tagged with a speaker name
The raw JSON is flattened and transformed into a tabular format before any analysis is performed.
Pre-processing
Before any modeling, the transcripts go through several cleaning and normalization steps implemented in Python.
-
Date parsing:
Spanish dates (e.g., “jueves, 16 de octubre de 2025Fecha de publicación”)
are cleaned and converted into standard
YYYY-MM-DDvalues. -
Transcript flattening:
Each paragraph in the transcript becomes one intervention, preserving the original order.
The result is a table with columns like
date,title,url,speaker, andtext. - Speaker normalization: Raw speaker labels (e.g. “PRESIDENTA DE MÉXICO, CLAUDIA SHEINBAUM PARDO:”, “SECRETARIO DE…”, “PREGUNTA”) are standardized using regular expressions. The goal is to group them into consistent categories: President, officials, journalists, and others.
-
Role grouping:
For analysis and visualization, normalized speakers are grouped into broader
speaker groups:
President/Officialvs.Journalist.
Sentiment analysis
Sentiment is estimated with a Spanish transformer model from the
pysentimiento library. For each intervention, the model outputs probabilities
for three labels: positive (POS), neutral (NEU),
and negative (NEG).
Instead of only using discrete labels, Morning Pulse computes a continuous sentiment score:
sentiment_score = P(POS) − P(NEG) ∈ [−1, 1]
- +1 means very likely positive
- 0 means balanced or neutral
- −1 means very likely negative
To reduce noise, the score is smoothed with a rolling average over several consecutive interventions, computed separately for each speaker group (President/Official vs. Journalist). This smoothed score is what you see in the Sentiment over time chart in the Explore tab, and what drives the background pulse line in the Narrative view.
Topics & states
Morning Pulse uses two complementary approaches to understand what each conference is about.
- Dictionary-based topics: Keywords in Spanish are grouped into broad themes like education, migration, poverty, health, security, and environment. The share of words matching each dictionary is computed daily and then aggregated weekly.
-
BERTopic clustering:
To avoid relying only on pre-defined words, interventions are also clustered using the
BERTopicalgorithm. It combines sentence embeddings, dimensionality reduction, and clustering to discover recurring semantic topics directly from the text.
For each conference, the Explore tab displays its most frequent BERTopic topics using simplified human-readable labels.
In addition, the model tracks mentions of Mexican states, using normalized matching for state names (e.g. “Veracruz”, “San Luis Potosí”, “Chiapas”). The Top 5 States mentioned chart shows which places are most visible in a given conference.
Behavioral metrics
Beyond text content, Morning Pulse also summarizes how the conversation behaves: who speaks, for how long, and how the dialogue flows.
- Conference length: The total number of words spoken in a conference, aggregating all interventions.
- Turn-taking structure: The number of interventions from the President vs. journalists, both overall and after the first journalist question, to focus on the interactive portion of the session.
- Words by role: Total words spoken per day for each speaker group (President/Official vs. Journalist), used to understand how much space each side occupies.
- Timing within the conference: Interventions are indexed in order, which makes it possible to plot sentiment over the course of the conference, highlighting shifts in tone across the interaction.
All these metrics are computed in Python and then stored in a compact JSON file used by the
Explore tab: explore_data.json.
Narrative & visualization design
The site combines two complementary interfaces built with web technologies:
- Narrative tab: A scroll-driven story based on a single conference. It uses D3 and Scrollama to synchronize text steps with a “pulse line” and speaker silhouettes. As the user scrolls, the visualization highlights key moments, changes in tone, and shifts between speakers.
-
Explore tab:
An interactive dashboard where users can select any conference and inspect:
smoothed sentiment over time, states mentioned, and most frequent topics.
The page reads from
explore_data.json, which centralizes the pre-computed metrics.
The visual design emphasizes a minimal layout, a limited color palette, and clear labels over heavy ornamentation, so that users can focus on the conversation and its structure.
Key definitions
- Intervention: one paragraph or turn in the transcript, spoken by a single speaker.
- Speaker group: a broad category that groups individual speakers into President/Official or Journalist, based on normalized labels.
- Sentiment label: the discrete output of the sentiment model: Positive, Neutral, or Negative.
-
Sentiment score:
a continuous value between −1 and 1, computed as
P(POS) − P(NEG), representing how positive or negative an intervention is. - Rolling average: a smoothed series where each point is the average of several neighboring interventions, used to reduce noise in the sentiment line.
- Topic: a cluster of semantically similar interventions discovered using BERTopic, or a dictionary-based theme defined by sets of keywords.
- Turn-taking ratio: the ratio of the number of interventions from the President to those from journalists, especially after the first journalist question, to capture the interactive part of the session.
- State mention: a reference in the transcript to a Mexican state, identified through normalized string matching.