Saturday, September 6, 2025

I Tested How Well AI Tools Work for Journalism

I Tested How Well AI Tools Work for Journalism Some tools were sufficient for summarizing meetings. For research, the results were a disaster. August 19, 2025 By Hilke Schellmann, Associate Professor of journalism at New York University. Although most newsrooms have AI policies in place and most permit reporters to use AI within certain guardrails, there is very little guidance on what tools can be used for what purpose, so most journalists tinker with the tools—but more evidence is needed: what tools are good at what task? Hilke Schellmann set out to test different LLMs and research AI tools for journalists. Schellmann is an associate professor of journalism at New York University. Journalists now have access to an abundance of AI tools on the market that promise to assist with tasks such as transcription, note-taking, summarization, research, and data analysis. Are these tools trustworthy enough for use in the newsroom? There is not yet a clear answer to that question. While most news organizations have AI policies, the guidelines are typically abstract and broad, and do not address a journalist’s daily workflow. In the absence of precise standards—which should be developed as a community— journalists have largely been left to figure things out for themselves. Many reporters have defaulted to what Cynthia Tu, a data reporter and AI specialist at the nonprofit newsroom Sahan Journal, calls “vibe checks,” or playing around with tools to get a feel for whether they are useful or not. Jeremy Merrill, a journalist at the Washington Post, used to spot-check AI tools to see which ones might work best for his data projects. But he realized his spot-check method was inadequate. “Vibes are not enough,” he said. “You’re not taking a good enough look at your real data. Is it 60 percent accurate? Seventy? Ninety-five? You just don’t know.” Florent Daudens, a press lead at Hugging Face, a platform for open-source AI tools, agrees that “vibe checks” of competing tools are not editorially rigorous. “You’re really only evaluating a stylistic preference,” Daudens said. “Do you prefer the style in which chatbot A answers rather than chatbot B? But you will not be able to evaluate if the summarization of this news article is better with model A than model B in terms of accuracy.” Journalists need more rigorous model assessments. So I developed exactly this kind of test with a team of academics, journalists, and research assistants at NYU Journalism, from Sloane Lab at the University of Virginia and MuckRock. As a starting point, we decided to look at two categories of AI tools that feel immediately useful for core journalistic work: chatbots for making summaries of meetings, and AI models for scientific research. Our research was conducted with support from the Patrick J. McGovern Foundation. Testing AI Tools for Summarization To report on the inner workings of governance and business, journalists spend a lot of their time reading reports and sifting through transcripts of long meetings. An AI tool that could summarize what happened and note what the most relevant speakers said would be a massive time-saver. To test how well large language models (LLMs) perform at this, my team and I asked four tools to summarize meeting transcripts and minutes from local government meetings in Clayton County, Georgia; Cleveland; and Long Beach, New York. The four chatbots we compared included ChatGPT-4o ($200/month), Claude Opus 4 ($100/month), Perplexity Pro ($20/month), and Gemini 2.5 Pro (free trial). (This month, OpenAI released the newer model ChatGPT-5, but past models including ChatGPT-4o are still available.) We based our evaluation in journalistic values of accuracy and truth. We checked that all facts were correctly recounted and noted any hallucinations, for which LLMs are notorious. Another vexing problem of AI tools is that they often generate slightly different responses to the same prompt, so we ran our queries through every tool five times and compared the outcomes. We also checked what personal or confidential data the company collects on the user through the software and how that information might be used. This is a critical question for reporters dealing with sensitive information and anonymous sources. To test the tools’ performance, we asked each to generate three short summaries (about two hundred words) and three long summaries (about five hundred words) for each city council meeting. The first short summary was prompted with simple and straightforward language: “Give me a short summary of this document.” The second summary was prompted with more details: “Write a summary of this document in 200 words in plain language that provides the key details of the meeting.” The third summary was prompted with six questions including: “What was the purpose of the meeting? Who spoke? What did they speak about/cover? What items were approved or denied?” We did this to assess how prompt engineering would affect the results. We then repeated the process and asked for long, detailed summaries instead of short ones. For long summaries the results were surprisingly poor. We ran all six prompts through all AI tools five times and compared the results with our own human-generated summaries. We judged each result on how clear and concise it was, how much information from the original document was retained, how factually accurate it was, if there were any hallucinations, how consistent the output was, and how easy it was to use the tools. Overall, we found that for the short summaries, every model except Gemini 2.5 Pro outperformed the human-written ones. The machine-generated short summaries included more facts and almost no hallucinations. Prompt A, “Give me a short summary of this document,” elicited the highest accuracy ratings overall compared with the more detailed prompts, which could be due to the specificity of the demand as well as the word limitations we set. For the long summaries, though, the results were surprisingly poor. Only about half the facts included in the human-generated long summaries were found in the AI-generated ones. The AI-generated long summaries also had more hallucinations than in the short summaries. Notably, the human-generated summaries took three to four hours each to complete, while the AI tools produced each summary in about a minute. Ultimately, ChatGPT-4o delivered the most reliable and accurate summaries for local government transcripts, making it the top-performing tool for journalists among the four tested. The facts it hallucinated or got wrong were consistently below 1 percent. Both ChatGPT-4o and Claude Opus 4 performed well at keeping facts accurate and consistent across different tests. ChatGPT-4o and Perplexity Pro were rated highest in user experience and most intuitive. However, all AI tools underperformed against the human benchmark in generating accurate long summaries. (I reached out to OpenAI, Anthropic, Perplexity, and Google for comment on the performance of their AI products. Only Perplexity responded, stating: “Perplexity’s core technology is accurate, trustworthy AI. We don’t claim to be 100% accurate, but we do claim to be the only company that’s relentlessly focused on it every day.”) For now, we recommend that journalists stick to using these tools to generate short summaries. Longer summaries of around five hundred words might be helpful to a reporter to understand the gist of what went on in a three-hour meeting, but journalists should be aware that the summaries may lack important facts. We recommend generating long summaries for background research only, perhaps in cases when a reporter does not have time to read the meeting transcript in full. AI-generated long summaries should not be used for publication. In general, we recommend using humans to generate any summary longer than a couple hundred words, and to always verify the facts. Testing AI Tools for Research Our second test looked at software of potential use to science journalists. Science reporters, including myself, often get press releases and pitches concerning “groundbreaking” new studies. Are the findings truly newsworthy? What does the rest of the field think about this work? Who might disagree? To get the necessary context requires extensive reading, deep sourcing in a narrow field of research, and dives into Google Scholar. When AI tools popped up on the market purporting to automate the finding of related papers—what scientists often call literature reviews—and to highlight the most important papers in scientific disciplines, my team and I were intrigued. One AI company, Consensus, suggests on its website that its tool can be used for a journalistic task: “Write a blog on evidence-based tips to avoid injuries while exercising.” Another company, ResearchRabbit, advertises its software as an “AI Tool for Smarter, Faster Literature Reviews.” Semantic Scholar says that its tool can find relevant research in more than 214 million papers. On Elicit’s website, author Torben Riise states: “I use Elicit almost every day for researching medical issues. It gets better and better. It’s simply the best tool to stay well informed.” We evaluated five AI research tools—Elicit, Semantic Scholar, ResearchRabbit, Inciteful, and Consensus—by asking the tools to generate literature reviews. Literature reviews are helpful to journalists because they aim to comprehensively explain scholarly work on a specific topic by putting that research in the larger context of a scientific field. We compared the AI-generated reviews against human-authored literature reviews found in four award-winning academic papers from the diverse scientific domains of social sciences, computer science, chemistry, and medicine. The AI research tools we tested ranged from free to $120 per year. The literature reviews taken from the four academic studies had between thirty-one and seventy-nine citations each. Those served as our benchmark. We gave each tool the four academic papers and asked it to generate a list of related papers. We then compared the output with the papers’ actual citations. The results were underwhelming and in some cases alarming. None of the tools produced literature reviews with significant overlap to the benchmark papers, except for one test with Semantic Scholar, where it matched about 50 percent of citations. Across all four tests, most tools identified less than 6 percent of the same papers cited in the human-authored reviews, and often 0 percent. For now, AI tools for research are more hype than help. The tools also disagreed wildly with one another. They didn’t just miss the citations in the human-authored literature reviews, they also missed each other’s. Some of the AI tools generated hundreds of seemingly related papers, only a few of which overlapped with the papers the other AI tools had pulled. In some cases there was no overlap at all. We didn’t pick up on any discernible patterns. It seemed like the tools could not even agree on a scientific consensus amongst one another. We also noticed that most of the tools generated inconsistent results when we ran the test again a few days later. We had expected the same results, since scientific consensus does not usually change overnight, but many of the tools generated a combination of the same results in a different order and the addition of dozens of new papers. This inconsistency raises concerns about how these tools define relevance or importance in a scientific field. A poorly sourced list of related papers isn’t just incomplete, it’s misleading. If a journalist relies on these tools to understand the context surrounding new research, they risk misunderstanding and misrepresenting scientific breakthroughs, omitting published critiques, and overlooking prior work that challenges the findings. Our research tool experiment had limits. We didn’t test every tool on the market, and we ran only four evaluations with academic studies across five tools. But across those, the performance was too inconsistent and the stakes too high to recommend using these tools as journalistic shortcuts. I reached out to all five research tool providers for comment. Four didn’t get back to me. Eric Olson, CEO of Consensus, stated, “Our goal with Consensus is to help researchers and students do literature reviews faster and we would not claim nor expect to be outperforming any work done by scientists in award-winning papers.” Should Journalists Use AI Tools at All? It’s important to note that we wouldn’t have arrived at our conclusions by simply playing around with these tools. We were surprised to see, for example, that while the LLM chatbots were able to produce fast and reliable short summaries of meeting transcripts, longer summaries of those same transcripts included only about 50 percent of the relevant facts. The companies that make these tools market them as versatile and all-purpose. But our research showed that they excel at some tasks and perform unreliably at others. We think using LLMs to produce short summaries may be immensely helpful for background research, though we still recommend a final fact-check by humans. As for the AI research tools for scientific literature that are currently on the market: They may save time, but right now, they lack the depth and consistency journalists need. For now, they are more hype than help. We will be watching to see whether the next wave of tools can do better. Additional research by Sophia Juco, Sandy Berrocal, Nneka Chile, Julia Kieserman, Jiayue Fan, Emilia Ruzicka, Mona Sloane, and Michael Morisy.