Methods

This page summarizes how the data was collected, processed, and analyzed to build the Narrative and Explore views of Morning Pulse. The goal is transparency: you can see how raw transcripts become interpretable visual stories.

Data source

Morning Pulse uses official transcripts of Mexican presidential press conferences (“mañaneras”) and related public messages published on the Presidencia de México website.

Each record in the raw JSON includes:

Date (in Spanish, e.g. “jueves, 16 de octubre de 2025”)
Title of the speech or conference
URL pointing to the original page
Transcript: a list of paragraphs, each optionally tagged with a speaker name

The raw JSON is flattened and transformed into a tabular format before any analysis is performed.

Data pipeline (overview)

Official web pages → Raw JSON (date, title, url, transcript) → Flattened interventions → Analysis & visualization

Pre-processing

Before any modeling, the transcripts go through several cleaning and normalization steps implemented in Python.

Date parsing: Spanish dates (e.g., “jueves, 16 de octubre de 2025Fecha de publicación”) are cleaned and converted into standard YYYY-MM-DD values.
Transcript flattening: Each paragraph in the transcript becomes one intervention, preserving the original order. The result is a table with columns like date, title, url, speaker, and text.
Speaker normalization: Raw speaker labels (e.g. “PRESIDENTA DE MÉXICO, CLAUDIA SHEINBAUM PARDO:”, “SECRETARIO DE…”, “PREGUNTA”) are standardized using regular expressions. The goal is to group them into consistent categories: President, officials, journalists, and others.
Role grouping: For analysis and visualization, normalized speakers are grouped into broader speaker groups: President/Official vs. Journalist.

From transcript to interventions

Raw transcript → Paragraph-level rows → Clean speaker labels → Speaker groups

Sentiment analysis

Sentiment is estimated with a Spanish transformer model from the pysentimiento library. For each intervention, the model outputs probabilities for three labels: positive (POS), neutral (NEU), and negative (NEG).

Instead of only using discrete labels, Morning Pulse computes a continuous sentiment score:

sentiment_score = P(POS) − P(NEG) ∈ [−1, 1]

+1 means very likely positive
0 means balanced or neutral
−1 means very likely negative

To reduce noise, the score is smoothed with a rolling average over several consecutive interventions, computed separately for each speaker group (President/Official vs. Journalist). This smoothed score is what you see in the Sentiment over time chart in the Explore tab, and what drives the background pulse line in the Narrative view.

Topics & states

Morning Pulse uses two complementary approaches to understand what each conference is about.

Dictionary-based topics: Keywords in Spanish are grouped into broad themes like education, migration, poverty, health, security, and environment. The share of words matching each dictionary is computed daily and then aggregated weekly.
BERTopic clustering: To avoid relying only on pre-defined words, interventions are also clustered using the BERTopic algorithm. It combines sentence embeddings, dimensionality reduction, and clustering to discover recurring semantic topics directly from the text.

For each conference, the Explore tab displays its most frequent BERTopic topics using simplified human-readable labels.

In addition, the model tracks mentions of Mexican states, using normalized matching for state names (e.g. “Veracruz”, “San Luis Potosí”, “Chiapas”). The Top 5 States mentioned chart shows which places are most visible in a given conference.

Behavioral metrics

Beyond text content, Morning Pulse also summarizes how the conversation behaves: who speaks, for how long, and how the dialogue flows.

Conference length: The total number of words spoken in a conference, aggregating all interventions.
Turn-taking structure: The number of interventions from the President vs. journalists, both overall and after the first journalist question, to focus on the interactive portion of the session.
Words by role: Total words spoken per day for each speaker group (President/Official vs. Journalist), used to understand how much space each side occupies.
Timing within the conference: Interventions are indexed in order, which makes it possible to plot sentiment over the course of the conference, highlighting shifts in tone across the interaction.

All these metrics are computed in Python and then stored in a compact JSON file used by the Explore tab: explore_data.json.

Narrative & visualization design

The site combines two complementary interfaces built with web technologies:

Narrative tab: A scroll-driven story based on a single conference. It uses D3 and Scrollama to synchronize text steps with a “pulse line” and speaker silhouettes. As the user scrolls, the visualization highlights key moments, changes in tone, and shifts between speakers.
Explore tab: An interactive dashboard where users can select any conference and inspect: smoothed sentiment over time, states mentioned, and most frequent topics. The page reads from explore_data.json, which centralizes the pre-computed metrics.

The visual design emphasizes a minimal layout, a limited color palette, and clear labels over heavy ornamentation, so that users can focus on the conversation and its structure.

Key definitions

Intervention: one paragraph or turn in the transcript, spoken by a single speaker.
Speaker group: a broad category that groups individual speakers into President/Official or Journalist, based on normalized labels.
Sentiment label: the discrete output of the sentiment model: Positive, Neutral, or Negative.
Sentiment score: a continuous value between −1 and 1, computed as P(POS) − P(NEG), representing how positive or negative an intervention is.
Rolling average: a smoothed series where each point is the average of several neighboring interventions, used to reduce noise in the sentiment line.
Topic: a cluster of semantically similar interventions discovered using BERTopic, or a dictionary-based theme defined by sets of keywords.
Turn-taking ratio: the ratio of the number of interventions from the President to those from journalists, especially after the first journalist question, to capture the interactive part of the session.
State mention: a reference in the transcript to a Mexican state, identified through normalized string matching.

About the author

José Manuel Cardona Arias

M.S. Computational Analysis and Public Policy, University of Chicago

José is a data scientist focused on using machine learning and data visualization to support public policy. His work has included projects on poverty measurement, labor standards enforcement, and health analytics with partners such as Innovations for Poverty Action and the Microsoft AI for Good Lab.

Morning Pulse was developed as part of a data visualization course at the University of Chicago. The project reflects José’s broader interest in making complex political communication more transparent, interpretable, and accessible to the public.