Multimodal Political Ad Tone Detection using Audio and Text

Faculty Sponsors: Meiqing Zhang, Furkan Cakmak, Erika Franklin Fowler

Iris Chen

Iris is a rising junior (26′) from Shenzhen, China, double majoring in Psychology and Computer Science, and is also pursuing an Applied Data Science Certificate. She is passionate about applying interdisciplinary knowledge to explore and solve complex problems, blending insights from psychology with data science to foster innovative solutions and deeper understanding. Outside of academics, Iris enjoys traveling, running, listening to music, and watching TV shows.

Abstract: This research conducts multimodal political ad tone analysis on Meta platforms during the 2022 US election. The tone of campaign advertising reveals valuable information about the persuasive strategies of political candidates, including the extent of negative campaigning. The primary purpose of this research is to compare audio, text, and multimodal approaches in political ad tone detection. Due to our interest in the power of audio features in tone analysis, we focused on video ads during the 2022 election and obtained a dataset of 2,394 audio files extracted from video political ads placed on Meta platforms, along with their corresponding video transcriptions and ad tone labels via the Wesleyan Media Project. Three supervised machine learning models were developed for ad tone detection, using only audio features, only text features, and a late fusion model combining audio and textual inputs. The labels have three classes based on whether an ad focuses on promoting the candidate, contrasting the candidate with their opponent, or attacking the opponent, indicating increasing levels of negativity. The audio features extracted are Mel Frequency Cepstral Coefficients (MFCC). Text features are represented by the Term Frequency-Inverse Document Frequency (TF-IDF) vectors. A Convolutional Neural Network (CNN) was trained for the audio-based model, and a Multinomial Naive Bayes (MNB) classifier was used for the text-based model. The multimodal model performed a late fusion of both models, incorporating information from both text and audio features. The experiments show that the audio-based model has the best performance, followed by the multimodal model. This study illustrates the utility of audio features, especially MFCCs, in tone and sentiment analysis.

Iris_Chen_Poster-1