TikTok’s Impact on Migrant Journeys: Unmasking Misinformation and its Challenges
The rise of social media platforms like TikTok has profoundly impacted information dissemination, including crucial details about migration journeys. Documented, a non-profit newsroom dedicated to serving immigrant communities in New York City, conducted an extensive investigation into the spread of misinformation on TikTok regarding US immigration, focusing on the experiences of migrants arriving in New York. This investigation unveiled the complexities of tackling misinformation and the innovative approaches Documented developed to address this growing challenge.
One of the primary hurdles in investigating online misinformation, particularly across multiple languages, is the ephemeral nature of the content. Those spreading false information frequently delete accounts, necessitating swift archiving of relevant content. Further complicating matters is the prevalence of audiovisual content on platforms like TikTok, making it difficult to conduct traditional text-based analysis. The sheer volume of videos uploaded daily presents another significant obstacle, rendering manual review impractical and requiring sophisticated methods for analysis.
Documented employed a multi-faceted approach, collaborating with community correspondents and developing technical tools to effectively manage the project. Identifying key accounts involved direct engagement with migrants, understanding their TikTok usage patterns, and consulting with experts to pinpoint recurring issues like predatory scams. Once target accounts were identified, a custom-built Python scraper was used to extract video URLs from downloaded HTML pages of the TikTok profiles. These URLs were then fed into yt-dlp, a powerful open-source tool, to download the videos and their associated metadata for preservation and subsequent analysis.
Processing the vast collection of downloaded videos required automated transcription. Documented leveraged Whisper, an open-source speech recognition model developed by OpenAI, to convert the audio content into text. While Whisper’s accuracy varied across languages (Vietnamese transcriptions proved unusable, while Spanish transcriptions were adequate), it facilitated a preliminary understanding of the video content, enabling researchers to identify key themes and problematic videos for further scrutiny. Recognizing the limitations of machine learning models and their potential biases, Documented acknowledged the inherent imperfections in the transcripts but utilized them as a valuable starting point for deeper analysis.
To further refine the analysis and manage the extensive dataset, Documented employed natural language processing (NLP) and topic modeling. NLP, a branch of machine learning, enabled the conversion of text into analyzable data, revealing word frequencies and patterns. Building on NLP, topic modeling, an unsupervised machine learning technique, grouped statistically related words, unveiling potential themes and connections within the video transcripts. This process highlighted prominent topics such as religious references, immigration procedures, and the CBP One app, allowing researchers to focus their attention on relevant subsets of videos.
Beyond automated analysis, Documented also recognized the importance of qualitative assessment. Detailed descriptions of highly viewed videos provided valuable insights into popular narratives and misinformation trends. Additionally, to ensure a representative overview of the vast content pool, a random sample of 1,000 videos was meticulously reviewed. This combination of macro-level analysis, driven by machine learning, and micro-level examination of individual videos provided a comprehensive understanding of the misinformation landscape on TikTok related to migrant experiences.
Documented’s innovative approach, combining technical tools, community engagement, and established research methodologies, serves as a valuable model for investigating misinformation within online communities. The Python-based code pipeline developed for this project, incorporating video link extraction, downloading, transcription, and topic modeling, offers a valuable resource for other researchers and journalists tackling similar challenges. It underscores the importance of adapting existing technologies and embracing new methods to effectively combat misinformation and its potential impact on vulnerable populations. The investigation also highlights ethical considerations in utilizing AI tools, emphasizing the need for careful evaluation of accuracy and potential biases, particularly when working with diverse linguistic and cultural contexts. Furthermore, the project emphasizes the ongoing need for transparency and collaboration within the journalistic community, fostering shared knowledge and best practices to counter the evolving landscape of online misinformation. The full code pipeline developed by Documented is available on GitHub, offering a valuable open-source resource for tackling similar challenges in other contexts and languages.