NTTVblog
Your vision will become clear only when you look into your heart.... Who looks outside, dreams. Who looks inside, awakens. Carl Jung
Saturday, June 13, 2026
Software Engineering, Data Science, Measuring Factual Quality in the Age of AI
https://www.normaltech.ai/p/why-ai-hasnt-replaced-software-engineers
Why AI hasn’t replaced software engineers, and won’t
Coding agents as normal technology
Arvind Narayanan and Sayash Kapoor
Jun 10, 2026
There is great anxiety and uncertainty about AI replacing jobs. How can we move past vague warnings and bombastic predictions and bring data to bear on this question? One good way is to look at the profession where AI capabilities are furthest along and adoption has been exceptionally rapid: software engineering.
In this essay, we argue that there is enough evidence to reject the narrative that once AI capabilities reach a certain threshold, it will cause mass layoffs. Given that this is true even in a sector with very few regulatory barriers, most other professions are likely to be even more cushioned.
We also have a good understanding of why this is the case. We can think of many kinds of knowledge work, including software development, as a “decide-execute-deliver sandwich”. AI compresses the “execute” layer — the middle of the sandwich — but the other two layers resist automation in a way that will not be overcome by capability improvements alone.
We conclude on a note of cautious optimism about the future trajectory of demand for software engineering. This essay is the first in a series, and the next one will look at reasons why individual software engineers’ careers might be rocky even if overall demand is healthy. The series is based on the published literature in economics and software engineering, our own evaluations and observations of AI agents, and many software engineers’ reflection on the present and future of AI impacts on their profession, gleaned both from published writings and our interactions with the community.
The stories of AI-driven mass layoffs in software seem to be classic “AI washing”
Consider three stories that made the headlines and how they contrasted with reality:
• In February, fintech company Block (maker of Cash App, Square, Afterpay, and other such apps) announced layoffs of 4,000 employees because, according to founder Jack Dorsey, AI is “enabling a new way of working” with “smaller and flatter teams”, specifically citing late-2025 improvements in model capabilities.
But subsequent reporting revealed a radically different picture. After growing headcount more than threefold during the pandemic, the company was under massive financial pressure. A data scientist on the Cash App team, Naoko Takeda posted that Block “shoved AI down everyone’s throats” yet she saw “very limited gains in productivity.” She refused a 75% retention raise and quit. Other employees interviewed had a sharply different understanding of what AI was capable of at Block and whether Dorsey had a competent understanding of the issues.
As Aaron Levie has pointed out, CEOs are uniquely prone to delusions about AI’s usefulness because they can build quick prototypes but can’t see the 90% of work it takes to turn it into a finished product. Dorsey’s public statements about AI seem to fit exactly this pattern.
• In April, Snap laid off about 1,000 people, with CEO Evan Spiegel primarily citing AI as the reason in his layoff memo. He also said that AI generated 65% of new code. In reality, the layoffs followed a campaign by an activist investor demanding cost cuts. (Snap has posted a net loss every full year since its 2017 IPO and shares were down over 30% in 2026). Tellingly, the nature of the cuts, such as 150 jobs spanning various roles in the augmented reality division, don’t correlate with the cuts we would expect to see if they were driven by AI (i.e. programming and other “AI-exposed” jobs across the board, not concentrated in any unit).
• In May, Intuit announced 3,000 cuts, alongside deals with Anthropic and OpenAI. The press connected the two, framing the layoffs as AI-driven restructuring. For once, the CEO actually pushed back on this easy narrative, saying that “none of it had to do with AI” and that the cuts targeted “coordination-heavy roles” and too many management layers.
We did not cherry-pick these examples. In every story about AI-driven software engineering layoffs that we examined, the same narrative violation emerged. It turns out that “AI washing” of job cuts is an economy-wide phenomenon, evidenced by many surveys:
• 59% of U.S. hiring managers admitted they emphasize AI when explaining hiring freezes or layoffs because it plays better with stakeholders than citing financial constraints.
• Forrester principal analyst J. P. Gownder says of companies preparing supposedly AI-driven layoffs: “When we ask if they have a mature, vetted AI app ready to fill in those jobs, nine out of 10 times, the answer is no—and they haven’t even started.”
• In a HBR survey of over 1,000 global executives, 21% had made large headcount reductions “in anticipation of” AI, with another 39% having made low or moderate anticipatory headcount reductions. In contrast, only 2% had already made large reductions in headcount related to actual AI implementation. The 10x gap suggests that executives, like everyone else, are highly prone to succumbing to the misleading narratives about AI replacing jobs.
Another interesting data point comes from the WARN Act, which requires certain disclosures of plant closings and mass layoffs affecting over 100 workers. In March 2025, New York became the first U.S. state to add an AI disclosure checkbox to WARN Act filings. In the full first year, more than 160 companies filed WARN notices. Not a single one checked the AI box.1 We reached out to the NY Department of Labor who confirmed that as of late May, only one company, Nespresso, checked the box.2 If these filings are accurate, only 46 out of about 25,000 laid off workers in New York State in the relevant period, or about two-tenths of a percent, were affected by AI.
Even more damning for the AI-driven-mass-layoffs narrative: layoffs are the wrong signal of AI’s potential productivity benefits in the first place! The research is clear that the effect operates through “slower hiring rather than increased separations”. Firing existing workers results in the loss of precisely the tacit knowledge and organizational capital that allows workers to operate AI effectively. Besides, it is expensive in terms of severance, damage to morale, and rehiring risk. Given these costs, it is largely unnecessary given that natural turnover achieves the same result in a few years.
So what does the data tell us when we look beyond layoffs to overall employment trends? An important paper from Federal Reserve economists compiles the evidence in the U.S. context. Software engineer employment is still growing, but they find that it is growing slower post-ChatGPT compared to a no-AI counterfactual, by about 3 percentage points per year. One important limitation of this study is that the methodology can’t capture self-employment, so it is possible that some of the slowdown in growth is being absorbed by entrepreneurship instead. We do have evidence from other studies that AI makes entrepreneurship easier. So the real picture is probably even healthier than the Federal Reserve study suggests.3
Finally, it is worth acknowledging two kinds of indirectly-AI-driven job losses in software engineering that are real, but different from AI replacing software engineers. First, AI sometimes decimates demand for the product, in cases like Chegg (homework help) or Stack Overflow (technical help), both of which have laid off workers. AI doesn’t directly do the job that these workers did, but rather obviates the need for it. The historical parallel is strong: Among the 270 jobs in the 1950 U.S. census, only one job was automated away — elevator operator. But many others were rendered obsolete by new technology, like the job of telegraph operator.
Another credible AI-driven layoffs story is among companies that sell AI, rather than buy it. So when companies like IBM or SAP announce layoffs because of AI, a more accurate framing is “we reallocated headcount from legacy functions to our fastest-growing product line.” That’s ordinary corporate restructuring around a revenue opportunity, not technology displacing workers.
Why coding agents haven’t led to labor displacement: the decide-execute-deliver sandwich
Many tech leaders, like the Snap CEO above, report the percentage of code written by AI alongside reports of layoffs or predictions of future job losses. This feeds into the simplistic mental model that once AI writes all the code, there is no need for coders. Fortunately, this mental model is wrong. This AI-written-code metric is almost completely disconnected from what matters for labor displacement. Here’s why.
Writing code isn’t, and never was, the bottleneck. For example, a 2019 paper summarized existing studies with the conclusion that “developers spend surprisingly little time with coding, 9% to 61% depending on the study”. This finding was consistent with the paper’s own data from 6,000 developers at Microsoft. As coding agents began to be taken up, there was an explosion of blog posts in late 2025 pointing out that writing code isn’t the bottleneck, as developers realized that using agents to write most of the code led to little impact on overall productivity [1, 2, 3, 4, 5, 6, 7, 8].
If writing code isn’t the bottleneck, what is? The task-breakdown surveys point at things like meetings or debugging. This just leads to more questions: what are developers doing in those meetings and why can’t it be done by AI? Won’t debugging get automated as capabilities improve? To understand the real bottlenecks, we have to get qualitative, and dig into software engineers’ own understanding of what it is they do that resists automation.
When we did this analysis, it revealed three things as the real bottlenecks (1) deciding and specifying what to build, (2) verifying and being accountable for what is delivered, and (3) the deep human understanding — of the codebase, the business, and the environment — required to carry out both of these.
In other words, software engineers’ work consists of a “decide-execute-deliver” sandwich (with understanding being a prerequisite for all three). AI has compressed the middle of the sandwich, but has left the two ends largely unchanged. As long as software development teams are in charge of decision making and accountable for what they deliver, engineers still need to spend time building up a deep understanding of the system. These are the three bottlenecks.
Figure: Software development consists of three layers: (1) Decision making — problem framing, specification, planning (2) execution — design and implementation (3) delivery — testing, verification, integration, maintenance, etc. Note that these are conceptual layers, not temporal phases. It is common to switch back and forth in the course of a project.
Evidence for the sandwich model of AI’s productivity effects comes from a recent paper on “Writing Code vs. Shipping Code”. Across 100,000 developers on GitHub, the researchers found that AI agents led to an eight-fold increase in the number of lines of code written, consistent with the idea that AI almost completely compresses the Execute layer of the sandwich. But this led to only 30% more releases, strongly suggesting that human bottlenecks (the Decide and Deliver layers) remain in place.4
Can the sandwich be further compressed? We don’t think so. At one end of the pipeline, development teams need to decide what to build. One of the most important lessons junior software engineers learn is that requirements specification (the profession’s lingo for this layer) takes surprisingly long, and if it is compressed, it leads to much more pain down the line. This layer is hard to automate because it requires thinking about user needs, market signals, organizational priorities, and in some cases regulatory constraints.
As AI capabilities improve, the kinds of decisions that can be delegated to AI increase over time. But this does not make the “decide” layer thinner — once a decision can be delegated to AI, it is no longer a source of competitive advantage, and the value of human decision-making migrates upward. Software increases in complexity over time, so there is no ceiling to this process.
At the other end of the sandwich, human teams need to be accountable for what they deliver. It is possible that some day in the future teams will ship mission-critical code without fully testing and understanding it, but today’s AI is so unreliable that such haphazard practices would represent an existential threat to software teams and their customers.
Even if the technical barriers go away in the future, we don’t have to cede control to AI. A central insight of AI as Normal Technology is that we can collectively choose to keep humans accountable through shared norms, law, and policy. This is a much more resilient way to control the speed of AI impacts and improve safety than trying to slow the development of technical capabilities. These speed barriers are already largely in place due to liability laws and sector-specific regulation, but can be further strengthened. (For a longer version of this argument, see the original essay.)
In this vision, as more and more of the execution layer gets delegated to AI, the software engineer’s role in the future becomes analogous to that of a crane operator. AI agents will do most of the cognitive heavy lifting; supervising the agent and keeping it in control becomes most of the human’s job.
Some commentators argue that a future with humans staying in control is unlikely because it is too costly to pay people to do so. There have already been a few viral stories of poorly-supervised coding agents deleting production databases or causing other types of damage. But we view these as “man bites dog” stories rather than an emerging norm. They go viral precisely because they represent such irresponsible and unusual behavior that they have shock value, and serve as regular reminders and learning moments helping the community guard itself against over-reliance on AI. As the aphorism goes, “if it’s in the news, don’t worry about it”. Still, being able to detect whether there is an uptick in poorly-supervised use of AI for high-stakes tasks — across the economy, not just in software engineering — remains one of the most critical data gaps we have today.
By the way, the sandwich getting squished is a new trend and it is not uniquely due to AI. Over two decades ago, the Bureau of Labor Statistics started tracking programming separately from software engineering. Roughly speaking, programmers are responsible only for execution while software engineers manage a bigger part of the sandwich. Not only has programming been shrinking, it is also pays much less because it is seen as grunt work. AI merely accelerates this long-existing trend, further devaluing purely technical skills.
Software engineering versus programmer employment. Chart by The Washington Post.
This pattern — where humans remain heavily involved at both ends of the decide-execute-deliver sandwich, even as AI increasingly automates the middle layer, seems to be broadly applicable to most knowledge work, though it is farthest along in software. After all, complex decision making and accountability are common to most fields. A lack of recognition of this phenomenon has led to many overconfident predictions about imminent job losses, such as among radiologists.
Vibe coding is not agentic engineering
One reason for confusion about the extent to which software engineering is changing is the sloppy use of the term “vibe coding” to refer to a wide spectrum of practices, the ends of which are conceptually distinct and more dissimilar than similar.
In true vibe coding the user simply tells the agent what to do, doesn’t supervise it when it’s running, doesn’t review the code — might not even have the skills to do so — and doesn’t evaluate the output, beyond perhaps noticing when things are visibly broken.
This is in contrast to how most software engineers are actually using agents — as a tool, with the human remaining in control and accountable for the output. Fortunately, the term agentic engineering is gaining currency as a descriptor of this practice.
As agentic engineering has become the norm, engineers are discovering that supervising coding agents is surprisingly time consuming. For example, Simon Willison, a prominent developer and chronicler of the AI transition, has noted how he is mentally exhausted by 11am from supervising agents. This is consistent with our experience as well.
More quantitative evidence comes from SWE-chat, a dataset of coding agent interactions from open-source developers who opted into a logging tool. The study found that only 44% of agent-produced code survives into user commits, that vibe-coded commits introduce vulnerabilities at nine times the human-only rate, and that the most common user intent is understanding existing code, not generating new code (19% vs 13%). The self-selected nature of the dataset means that we can’t draw strong conclusions based on this study alone, but it does reinforce many other lines of evidence that vibe-coding and agentic engineering patterns are quite different.
Agentic engineering is not vibe coding
To re-iterate, these are not two distinct categories. They are two ends of a spectrum, and there is a blurry middle. Not every project is either a throwaway or mission-critical. Not every workflow fits precisely in the left column or the right column of the table. But the key implication for the jobs question remains solid — companies can’t ship production software by hiring unqualified vibe coders instead of software engineers.
What does the future hold?
AI boosters might claim that mass layoffs are coming; they just haven’t happened yet because human-level software engineering abilities are very recent (or haven’t been achieved yet). But if the sandwich model is correct, these predictions won’t come true. AI has already largely compressed the middle of the sandwich (and the compression actually started decades ago). So even making the execution layer instant and perfect will only be a small change from the status quo. The reasons why the other two layers have resisted AI is not because of capability limitations.
In fact, not only are software engineering jobs not going away due to AI, there might even be an increase in demand for software engineers. When software (or anything else) gets cheaper to create due to technological productivity improvements, people will buy a lot more software (in econ jargon, software is highly “price elastic”). And as we have argued, AI doesn’t replace software engineers (the “elasticity of substitution” is low), so the demand for more software results in a derived demand for more software engineers. A loosely related but flashier economics term, “Jevons’ paradox”, is often thrown around in the AI discourse to describe this concept.
Historically, this has been the pattern — programmer employment in the U.S. has grown from near-zero around 1950 to millions today. This is sharply different from occupations such as agriculture in which labor demand was famously decimated due to mechanization and automation. The difference is that the amount of calories people consume is relatively fixed — even a 25% increase led to the obesity epidemic — whereas the amount of software produced has grown a millionfold. Modern cars have something like a hundred million lines of code running on their various on-board computers.
If there is a ceiling to the demand for code, we are nowhere near it. Virtually all cognitive work benefits from software. As AI makes coding cheaper, people are creating all kinds of one-off utilities — whether for work or personal use — that it never made sense to create until now.
To be clear, while we think there will be a lot more software in the future, and likely more software engineers, this doesn’t mean big tech companies will get even bigger. The majority of software engineers today already work in-house in non-software firms, and that share might grow in the future. Then there’s the idea of “AI rollups”, which refers to venture capital or private equity firms buying “Main street” businesses — dentistry practices, accounting firms, and whatnot — and rebuild them from the ground up to be “AI-native” by embedding software engineers or AI engineers into those businesses. Of course, it might end up being nothing more than hype. It’s too early to tell.
Some people predict that demand for software engineering skills will fall because of democratization. They acknowledge that there will be more software produced than ever before, and also that more human time will be spent producing software than ever before, but that this work will be done by people who are not software engineers. The idea is that AI will democratize software engineering to the extent that legal software, for instance, can be more easily created by those with training in law than in software engineering.
Maybe. But we’ll bet against it. In our view, this falls into the same trap of conflating vibe coding with agentic engineering, and the execution layer with the the whole decide-execute-deliver sandwich. In fact, when we look at the history of programming, there have always been claims that we are at the threshold of democratization — old languages such as FORTRAN, COBOL, and SQL were all accompanied by such prominent hopes at the time of their introduction. It never happened. The barrier isn’t actually learning the syntax. It’s having enough skilled judgment to make good decisions while maintaining accountability.
Ultimately the distinction may be semantic. It seems clear that the amount of time people spend on getting computers to do new things will increase over time. This might take the form of building software, or managing complex workflows using agents, or something else. It will require a mix of software skills, AI skills, and domain expertise. Whether it is today’s software engineers who will best adapt to fill these new roles remains to be seen.
That last point about the need for adaptation sets up the next essay in this series. The fact that aggregate labor demand in software is likely to remain strong doesn’t mean that most individual workers won’t be affected. We will argue that AI will create massive structural shifts in how software is produced, which will have big impacts on which software engineers stand to gain or lose — based on the types of firms they work in, their geography, their seniority, the pace at which they can adapt.
Further reading
Deena Mousa points out the superficiality of broad, economy-wide analyses of AI impacts based on metrics like “AI exposure”, and instead calls for “careful, occupation-specific work”. We hope that this series of essays will play a role in establishing a nuanced understanding of AI’s transformation of software engineering. We’ve previous coauthored, with Justin Curl, a paper analyzing AI in legal services that seriously engages with regulatory and other bottlenecks that make that occupation unique. We plan to do more occupation-specific deep dives in the future.
In a remarkable essay called No Silver Bullet 40 years ago, Fred Brooks distinguished between the “essential complexity” and “accidental complexity” of software. He argued that some of the complexity of software is accidental, arising from limitations of present technology such as the clunkiness of programming languages, and can be alleviated over time as tooling improves. But some of it is essential, because specifying the correct behavior of software is itself hard. He presents a forceful articulation of why the “decide” layer of the sandwich is thick and resists automation. Interestingly, hopes of boosting programmer productivity through AI were already prominent back then! Brooks argues that because AI or any other technology only reduces accidental complexity, it won’t result in an order-of-magnitude productivity improvement. (Brooks is the author of The Mythical Man Month, an essay collection that is almost certainly the best known and most influential writing on software engineering of all time. No Silver Bullet later became part of the collection.)
We are grateful to Felix Chen for feedback on a draft.
1
The checkbox is actually labeled “technological innovation or automation”. If checked, there is a second menu that to disclose the specific technology such as AI or robotics.
The current WARN Act data have various limitations — it is New York only, and it is possible that companies are under-reporting AI as a reason for layoffs because of ambiguity or asymmetric risks from checking versus not checking the box (though we have no specific reason to think this). Stronger transparency requirements are in the works at both the federal and state levels; closing this data gap is urgent.
2
We are grateful to our colleague Mihir Kshirsagar for connecting us to the New York State Department of Labor and Elena Grovenger from the department for a prompt response.
3
The paper uses the term coder, but it defines the term based on skills rather than roles, resulting in a broad sweep of jobs that is much broader than “coding”. Measurements based on industry, title, and skills cannot be easily compared to one another.
4
Interestingly, in a sub-study looking at mobile apps, the paper found that the usage of the resulting apps did not go up at all. This gets at one important difference between consumer and enterprise software. The former competes for a relatively fixed pool of attention; more apps published doesn’t mean more hours of app usage. But in enterprise software there is a lot of room for growth, as previously human processes can be software-mediated or automated.
Subscribe to AI as Normal Technology
Launched 4 years ago
Analyzing AI as transformative but normal technology, not superintelligence.
https://blog.citp.princeton.edu/2026/06/11/ai-is-already-giving-medical-conclusions-are-they-any-good/
AI Is Already Giving Medical Conclusions. Are They Any Good?
June 11, 2026
– by
Center for Information Technology Policy
Comments
Artificial Intelligence, Data Science & Society
Authored by: Hayoung Jung
Recently, I was talking with some family members from South Korea who mentioned their back pain. My immediate question: “What did the doctor say?” Healthcare is highly accessible and affordable in South Korea, so I assumed they had already seen one.
Nope. They asked ChatGPT.
In all honesty, this was not truly surprising given how useful these models are. But the moment captures a growing social phenomenon happening everywhere. AI systems are becoming the first stop for health and scientific questions, even in countries where professional care is available and accessible.
And people are not just asking these systems to retrieve webpages or list sources, as they might in traditional search engines. Agentic systems, such as Google AI Overview, OpenEvidence, and OpenAI Deep Research, synthesize information from multiple sources and present immediate conclusions to users’ questions in real time. Increasingly, users are directly asking, What is my diagnosis? What are the best treatment options? What should I do next?
Reports suggest this is happening across audiences. Laypeople ask AI systems about symptoms, treatments, and scientific claims, while more than 80% of U.S. physicians use them in their professional workflows, including to explore medical questions and support decision-making. When AI systems are becoming the first (or even the only) stop for health and scientific questions, are they even reliable at synthesizing scientific evidence into conclusions that people may actually act on?
A Benchmark for Scientific Synthesis
To answer this, I worked with my amazing PhD advisors Manoel Horta Ribeiro and Aleksandra Korolova (who also have their own Substacks here and here) to create a benchmark for evaluating how well current AI agents synthesize scientific conclusions from the open web.
Scientific conclusion synthesis requires several steps. An agent must retrieve relevant evidence from the open web, filter out irrelevant or low-quality sources, reason across multiple studies, weigh conflicting findings, preserve uncertainty, and synthesize a long-form conclusion. Importantly, these kinds of tasks are long-horizon and open-ended, as expert scientists often spend months searching the literature on the open web, evaluating studies, and synthesizing careful conclusions about what the evidence in the field actually supports.
To evaluate this, we built SciConBench, a large-scale benchmark of 9.11K scientific questions paired with expert-written conclusions from Cochrane systematic reviews, a gold standard in evidence-based medicine. Each SciConBench task asks an AI agent to use web tools to answer a scientific question with a paragraph-length conclusion, which we compare against the corresponding expert-written Cochrane conclusion. Importantly, SciConBench is a live benchmark: it is continuously updated as new Cochrane reviews are published, enabling timely evaluations and reducing benchmark leakage as new models are trained on recent web data.
Overview of SciConBench. We evaluate whether AI agents can use tools to synthesize scientific conclusions from the open web, without simply retrieving the expert-written answer online. We compare AI-generated conclusions against expert-written Cochrane conclusions by measuring how accurate and complete their factuality are. Even under this controlled setup, frontier AI agents struggle to synthesize reliable scientific conclusions.
The Leakage Problem
While running SciConBench, we ran into a surprising issue from looking at our agent logs: AI agents were explicitly looking for the benchmark answers directly from Cochrane review articles, even when we instructed them not to in the system prompt. Anthropic recently released a neat blog on this phenomenon called “evaluation awareness,” in which these models would know they are being evaluated and explicitly look for answers online.
As models become increasingly capable, a major challenge in evaluating web-enabled agents is that they can often find the answer directly. If a benchmark question comes from a published systematic review, an agent with web access may simply retrieve the review itself, or another webpage that covers its conclusion (e.g., news coverage). At that point, the task is no longer about synthesizing the scientific evidence from scratch, but rather merely retrieving the ground-truth answer (a much easier task!). The model may look impressive, but we would not be measuring the capability we actually care about.
To address this, we built SciConHarness, a clean-room evaluation harness. This evaluation harness enforces the clean-room protocol, ensuring agents have controlled access to web search, browsing, and paper search tools, while filtering out ground-truth artifacts such as Cochrane pages and review articles that could leak the answer. This lets us evaluate whether the agent can synthesize the conclusion from the open-web evidence, rather than shortcutting to the already-written expert answer.
Measuring factual quality
In our study, we work with doctors to validate every component of our benchmark creation and evaluation pipeline. After an AI agent synthesizes a conclusion from the open web, we evaluate their conclusions using our expert-validated factual evaluation pipeline. Instead of judging the whole paragraph at once, the idea is we decompose both the AI-generated conclusion and the expert-written reference conclusion into a series of facts, e.g., statements containing a single piece of information. Then, we measure two things:
• Factual precision (correctness): Are the facts in the AI-generated conclusion supported by the reference, or do they contradict it?
• Factual recall (coverage): Does the AI-generated conclusion cover the key facts from the reference conclusion needed to answer the question?
We use these two metrics because a scientific conclusion can fail in different ways. A conclusion may contain incorrect claims – for example, by overstating weak evidence or flipping the direction of a treatment effect. Alternatively, it may be mostly true but incomplete, omitting key facts or caveats that matter for decision-making. To capture both correctness and completeness, we also report Factual F1, the harmonic mean of factual precision and factual recall. In other words, a system can only score highly on F1 if it performs well on both dimensions: it must avoid making unsupported or contradictory claims, while also covering the key facts needed to answer the question. All metrics range from 0 to 1, with higher being better.
So how do these AI agents perform?
Our benchmark results. Note that each metric ranges from 0 to 1, with higher being better! We test across frontier models and deep research agents (DR) using SciConHarness, where the best score under the clean-room was 0.337 factual F1-score. As shown in \delta_{Clean} F1, we found models and deep research agents consistently decrease in performance when applying the clean-room.
Let’s see the benchmark results above! Across frontier models and deep research agents, synthesizing scientific conclusions remains far from solved. Under clean-room evaluation, which better isolates true synthesis capability, the best-performing agent (OpenAI’s o3-deep-research) achieved only a factual F1 of 0.337. In other words, even the strongest systems struggled to produce conclusions that were both correct and comprehensive with respect to the expert-written Cochrane reviews.
We also found that clean-room evaluation consistently reduced performance. When agents had unrestricted web access (e.g., no clean-room), they performed better. However, when we filtered out ground-truth leakage with our clean-room, their scores consistently dropped. This suggests that some apparent performance in open-web evaluations comes from retrieving benchmark artifacts, not genuinely synthesizing conclusions from evidence.
This leakage issue is important beyond our benchmark. If we evaluate AI agents in environments where they can shortcut and find the answer directly, we may overestimate their real capabilities, especially for high-stakes tasks in health and science.
The deployed agents were also unreliable.
We audit consumer-facing agents, like Google AI Overview and OpenEvidence, using our benchmark! Given that these tools are used millions of times in real-world health decision-making, this could result in substantial amounts of incorrect advice given to both clinicians and laypeople.
We also audited consumer-facing agents, including Google AI Overview, Google AI Mode, and OpenEvidence. These agents are already being used by laypeople and clinicians to synthesize health information. OpenEvidence, in particular, is marketed as a “clinical AI copilot for doctors” for “high-stakes decisions” and is used hundreds of millions of times in the medical context.
Looking more closely at the table above, even when these agents had access to the ground-truth review, their conclusions were often incomplete and sometimes contradictory. OpenEvidence performed best among the audited agents, but still covered only about half of the reference facts and produced contradictory claims: in fact, 50.8% of its generated conclusions contained at least one claim that contradicted the Cochrane review.
Google AI Overview and Google AI Mode performed worse, with lower coverage and similarly concerning contradiction rates: 56.3% and 59.0% of their conclusions, respectively, contained at least one contradiction. In many cases, the ground-truth answer was already available online, meaning the models should have been able to identify, retrieve, and prioritize such high-quality sources. This suggests that the failure likely occurred somewhere in the synthesis process, such as evaluating the quality of evidence, integrating high-quality ones, and communicating the evidence correctly.
So what?
Scientific conclusions are compressed decision-making tools. The optimistic view of AI agents is that they will help democratize expertise by synthesizing these scientific conclusions at scale in real-time. A clinician could quickly get up to speed on an unfamiliar condition. A patient, including someone like my own family member with back pain, could determine whether a treatment seems promising. A scientist could accelerate literature review and understand the frontiers of science. A policymaker could synthesize scientific conclusions before making a decision. The vision is compelling.
However, our results suggest that current systems are not yet reliable enough to synthesize scientific conclusions, especially in high-stakes settings like health where even a single misleading answer can deeply impact stakeholders.
These agents can generate seemingly competent conclusions that omit key information, include unsupported claims, or contradict expert reviews, creating the risk of patients, clinicians, scientists, and policymakers relying on conclusions that do not faithfully reflect the underlying evidence.
Given that these tools are used hundreds of millions of times in health contexts, even modest error rates could translate into a substantial amount of misleading advice or unsafe answers in practice. Our findings suggest that these systems and their use in clinical settings deserve much greater public scrutiny.
While AI agents provide real utility in health and science, we need to be much more precise about what they can and cannot do. With SciConBench, we hope to push agentic evaluation closer to an important real-world task we expect these systems to perform: synthesizing careful scientific conclusions from the open web.
More broadly, we see this work as part of the measurement infrastructure needed for AI systems in high-stakes domains. If these systems are going to be used in medicine and science, we need stronger evaluations of the tasks people actually delegate to them, along with greater transparency from AI providers, including usage data and post-deployment monitoring. Without that transparency, it is difficult to know how often these errors happen in the real world, who is affected, and when they lead to harm.
For now, our results suggest that we should treat these systems less like expert reviewers and more like fallible assistants: useful in some contexts, but requiring careful expert oversight, independent verification, and much stronger evaluation before they are trusted in high-stakes decisions. AI may one day help democratize expertise. But until then, ask a doctor or a scientist before letting the chatbot make the call.
Interested in reading more? Check out our paper!
Hayoung Jung is a Ph.D. student in computer science at Princeton University, co-advised by Manoel Horta Ribeiro and Aleksandra Korolova. His research broadly focuses on advancing inclusive AI technologies and online platforms to better serve society and communities often overlooked in system development. Drawing on an interdisciplinary background, Hayoung develops technical frameworks and methods grounded in social science theories, with two main goals: auditing AI systems and online platforms, and studying social phenomena such as community norms through language and online behavior. He completed his undergraduate degrees in computer science and political science, and his M.S. in computer science, at the University of Washington.
https://arxiv.org/pdf/2606.11337
Sunday, May 17, 2026
Kinh Thủ Lăng Nghiêm giảng giải - Bảy Đoạn Phật Hỏi Về Tâm - Lê Sỹ Minh Tùng
Friday, May 15, 2026
Frauds in HealthCare, MediCare and Medicaid
1/ Medicare, Home Care... and Frauds
https://smpresource.org/medicare-fraud/fraud-schemes/home-health-care-fraud/
Medicare Parts A and B cover intermittent or short-term home health services. These services must be provided by a Medicare-approved home health agency that works with your doctor to manage your care. To be eligible for Medicare coverage:
• Your doctor must determine it’s medically necessary for you to receive skilled care services at home. Skilled care services at home could include part-time or “intermittent” nurse and nurse aide visits (personal, hands-on care) and rehabilitation services, which include speech-language pathology, physical and occupational therapy, and medical social services.
• Your condition must be expected to improve in a reasonable amount of time or your condition requires skilled therapy to maintain your current condition or prevent or slow, further deterioration.
• You must be considered “homebound.” This means you are unable to leave your home without assistance, it requires considerable and major effort, or it is considered dangerous due to your current health condition. You may leave home for medical care and some short or infrequent outings (for example, worship services) as long as you meet these conditions.
o Note: Even if you do not qualify for home health services, you may still be eligible to receive outpatient therapy services in a doctor’s office, outpatient hospital setting, rehabilitation agency, Comprehensive Outpatient Rehabilitation Facility (CORF), public health agency, or your home. Outpatient therapy services are covered by Medicare Part B and subject to the 20% copayment.
Report potential home health care fraud, errors, or abuse if:
• You see on your Medicare Summary Notice (MSN) or Explanation of Benefits (EOB) charges for:
o Home health services when you did not meet Medicare’s “homebound” criteria
o Services that were not deemed medically necessary by your doctor
o Home health services like skilled nursing care and/or therapy services that were not provided
• You were:
o Enrolled in home health services by a doctor you do not know
o Offered things such as “free” groceries or a “free” ride from a home health agency in exchange for your Medicare number or to switch to a different home health agency
o Charged a copayment for home health services
o Asked to sign forms verifying that home health services were provided even though you did not receive any services
• Someone came to your home and provided housekeeping or medication services, but you see on your Medicare Summary Notice (MSN) or Explanation of Benefits (EOB) that Medicare was billed for a covered service like skilled nursing or other therapy instead.
• You accept cash or gifts in exchange for going along with a home health scam.
To learn more about tips related to home health care fraud, click here.
To learn how to read your Medicare Summary Notice (MSN) and Explanation of Benefits (EOB), click here.
Report Suspected Fraud
To report suspected fraud, click here.
Report Suspected Medicare Fraud
SMP Resources
• Home Health Care Fraud Tip Sheet
(English) (Arabic) (Chinese Simplified) (French) (German) (Korean) (Russian) (Spanish) (Tagalog) (Vietnamese)
• Home Health Care Fraud Infographic
(English) (Arabic) (Chinese Simplified) (French) (German) (Korean) (Russian) (Spanish) (Tagalog) (Vietnamese)
• Home Health Care Fraud Video
2/
https://www.npr.org/sections/health-shots/2020/01/21/789958067/patients-want-to-die-at-home-but-home-hospice-care-can-be-tough-on-families
...Usually, hospice care is offered in the home, or sometimes in a nursing home.
Since the mid-1990s, Medicare has allowed the hospice benefit to cover more types of diagnoses, and therefore more people. As acceptance grows among physicians and patients, the numbers continue to balloon — from 1.27 million patients in 2012 to 1.49 million in 2017.
According to the National Hospice and Palliative Care Association, hospice is now a $19 billion industry, almost entirely funded by taxpayers. But as the business has grown, so has the burden on families, who are often the ones providing most of the care.
For example, one intimate task in particular changed Joy Johnston's view of what hospice really means — trying to get her mom's bowels moving. Constipation plagues many dying patients.
"It's ironically called the 'comfort care kit' that you get with home hospice. They include suppositories, and so I had to do that," she says. "That was the lowest point. And I'm sure it was the lowest point for my mother as well. And it didn't work."
Hospice agencies primarily serve in an advisory role and from a distance, even in the final, intense days when family caregivers, or home nurses they've hired, must continually adjust morphine doses or deal with typical end-of-life symptoms, such as bleeding or breathing trouble. Those decisive moments can be scary for the family, says Dr. Joan Teno, a physician and leading hospice researcher at Oregon Health and Science University.
How To Be A Better Caregiver When A Loved One Gets Sick
"Imagine if you're the caregiver, and that you're in the house," Teno says. "It's in the middle of the night, 2 o'clock in the morning, and all of a sudden, your family member has a grand mal seizure."
That's exactly what happened with Teno's mother.
"While it was difficult for me to witness, I knew what to do," she says.
In contrast, Teno says, in her father's final hours, he was admitted to a hospice residence. Such residences often resemble a nursing home, with private rooms where family and friends can come and go and with round-the-clock medical attention just down the hall.
Teno called the residence experience of hospice a "godsend." But an inpatient facility is rarely an option, she says. Patients have to be in bad shape for Medicare to pay the higher inpatient rate that hospice residences charge. And by the time such patients reach their final days, it's often too much trouble for them and the family to move.
HHS Inspector General Finds Serious Flaws In 20% Of U.S. Hospice Programs
Hospice care is a lucrative business. It is now the most profitable type of health care service that Medicare pays for. According to Medicare data, for-profit hospice agencies now outnumber the nonprofits that pioneered the service in the 1970s. But agencies that need to generate profits for investors aren't building dedicated hospice units or residences, in general, mostly because such facilities aren't profitable enough.
Joe Shega, chief medical officer at for-profit Vitas, the largest hospice company in the U.S., insists it's the patients' wishes, not a corporate desire to make more money, that drives his firm's business model. "Our focus is on what patients want, and 85 to 90 percent want to be at home," Shega says. "So, our focus is building programs that help them be there."
For many families, making hospice work at home means hiring extra help....
This experience of family caregivers is typical, but often unexpected.
'It's a burden I lovingly did'
"It does take a toll" on families, says Katherine Ornstein, an associate professor of geriatrics and palliative medicine at Mount Sinai Hospital in New York, who studies what typically happens in the last years of patients' lives. The increasing burden on loved ones — especially spouses — is reaching a breaking point for many people, her research shows. This particular type of stress has even been given a name: caregiver syndrome.
"Our long-term-care system in this country is really using families — unpaid family members," she says. "That's our situation."
A few high-profile advocates have even started questioning whether hospice is right for everybody. For some who have gone through home hospice with a loved one, the difficult experience has led them to choose otherwise for themselves.
Social worker Coneigh Sea has a portrait of her husband that sits in the entryway of her home in Murfreesboro, Tenn. He died of prostate cancer in their bedroom in 1993.
Coneigh Sea is a social worker from Murfreesboro, Tenn., who cared for her husband as he died on home hospice. Now, she wants to make sure her children don't do the same for her.
Blake Farmer/WPLN
Enough time has passed since then that the mental fog she experienced while managing his medication and bodily fluids — mostly by herself — has cleared, she says. But it was a burden.
"For me to say that — there's that guilt," she says, then adds, "but I know better. It was a burden that I lovingly did."
She doesn't regret the experience but says it is not one she wishes for her own grown children. She recently sat them down, she says, to make sure they handle her death differently.
"I told my family, if there is such a thing, I will come back and I will haunt you," she says with a laugh. "Don't you do that."
Sea's family may have limited options. Sidestepping home hospice typically means paying for a pricey nursing home or passing away with the cost and potential chaos of a hospital — which is precisely what hospice care was set up to avoid.
As researchers in the field look to the future, they are calling for more palliative care, not less — even as they also advocate for more support of the spouses, family members and friends who are tasked with caring for the patient.
"We really have to expand — in general — our approach to supporting caregivers," Ornstein says, noting that some countries outside the U.S. pay for a wider range and longer duration of home health services.
"I think what we really need to do is be broadening the support that individuals and families can have as they're caring for individuals throughout the course of serious illness," Ornstein says. "And I think that probably speaks to the expansion of palliative care in general."
Blake Farmer's reporting on end-of-life care is part of a reporting fellowship on health care performance, sponsored by the Association of Health Care Journalists and supported by the Commonwealth Fund.
3/
https://www.kff.org/medicaid/understanding-medicaid-home-care-amid-cms-focus-on-potential-fraud-and-abuse/
Understanding Medicaid Home Care Amid CMS Focus on Potential Fraud and Abuse
Authors: Alice Burns, Abby Wolk, and Robin Rudowitz
Published: Feb 24, 2026
PrintEmailCopy LinkAdd KFF on Google
Potential fraud in state Medicaid programs is getting renewed attention, with a recent emphasis on home care, also known as personal care or in-home supportive services. Home care helps with self-care activities such as bathing, dressing, and eating for older adults and people with disabilities. KFF estimates that over 5 million people use Medicaid home care, which allows individuals to receive long-term care without moving into an institution. The Trump administration has recently pointed to Medicaid home care as a source of fraud. Medicaid home care is susceptible to fraud because services are provided in people’s homes to vulnerable individuals who may be less able to advocate for themselves, including some with Alzheimer’s and other dementias. However, there are also additional safeguards against fraud in Medicaid home care compared to other types of Medicaid services. This issue brief describes how Medicaid home care operates, including who is eligible, the various systems in place to promote program integrity in its delivery, and challenges using data newly released by the Centers for Medicare and Medicaid Services (CMS). Key takeaways include the following.
• All states provide optional home care services to people whose needs are sufficient to warrant institutionalization. An institutional level of care is generally beyond what family members are capable of providing.
• Recognizing the higher risk of fraud in Medicaid home care, federal and state governments have implemented additional tools to identify and detect home care fraud. States, along with the federal government, use provider credentialing and enrollment and data analytics to help prevent fraud. There has been new attention on fraud in Minnesota’s Medicaid program recently, but the fraud, and the state’s work to root it out, date back at least 18 months.
• On February 14, 2026, CMS released a dataset with provider-level spending data that the agency suggests could be used to identify unusual billing patterns for specific services, states, or providers, but the limited data could result in mistaken conclusions. Home care is a major emphasis of the new dataset, which stems from the fact that second to hospital spending, long-term care is the second-largest source of Medicaid spending. Although Medicaid long-term care was historically provided primarily in nursing facilities, most enrollees who use long-term care now receive home care.
Why does Medicaid cover home care and who is eligible for services?
All states provide optional home care services. Under Medicaid, states are required to cover long-term care provided in nursing facilities, but not home care, which has been referred to as the “institutional bias” in Medicaid. States may only provide home care if they can demonstrate that providing the services would cost no more than institutional care would cost for an individual. All states choose to provide optional home care to people who would otherwise require institutionalization. The increased availability of home care reflects people’s preferences to remain in their homes. Expansions of Medicaid home care services also followed the 1999 Supreme Court ruling in Olmstead v. L.C., which declared that unjustified institutionalization of people with disabilities by a public entity (including Medicaid) is a form of discrimination and not permissible under the 1990 Americans with Disabilities Act. Even though nearly all of the benefits are optional for states to provide, the majority of people who use long-term care now do so at home.
Medicaid home care use is limited by eligibility criteria that generally make it only available to people whose needs are sufficient to warrant institutionalization. To be eligible for Medicaid home care, applicants must meet both financial and “functional” eligibility criteria. Functional eligibility for Medicaid home care, which is evaluated by assessment tools developed by states, generally requires individuals to demonstrate that they need an institutional level of care. There are no recent data available about states’ specific definitions for an institutional level of care, but it generally indicates that people would require 24-hour services and assistance with multiple activities of daily living (ADLs), which include bathing, dressing, eating, toileting, continence, and transferring between bed and other settings.
An institutional level of care is generally beyond what family members are capable of providing. People who require an institutional level of care generally have complex needs that require both skilled and unskilled services and often require services to be provided around the clock. In some cases, family caregivers may not have the medical expertise to provide services, but there are also challenges related to the physical demands of the job and having time to provide such intensive services. Helping family members to bathe, dress, and toilet themselves often requires the strength to lift them, which not all family members have. The time required to provide such intensive services also makes it difficult for family caregivers to provide this level of care and maintain employment or take care of their own health needs. KFF’s focus groups with paid and unpaid family caregivers provide detail that caregiving is physically, mentally, and emotionally challenging; and that family caregivers cannot provide an institutional level of care without supports. To help people requiring an institutional level of care remain at home, Medicaid supports family caregivers by providing supplemental paid care and with direct supports, such as respite care, training, and in some cases payments to the family caregivers to reflect the fact that caregiving makes it impossible to maintain outside employment.
What program integrity tools for Medicaid home care exist?
Recognizing the higher risk of fraud in Medicaid home care, federal and state governments have implemented additional tools to identify and detect home care fraud. In 2016, Congress passed the 21st Century Cures Act, which requires states to implement electronic visit verification for all Medicaid personal care and home health services if a visit is made to a person in the home. State’s electronic visit verification must include six data elements: member receiving the services, caregiver providing the service, type of service, location of the service delivery, date of the service, and time the service begins and ends. Electronic visit verification was established to help promote fiscal integrity for Medicaid home care, and states had until 2023 to fully implement the requirements. The Health and Human Services Office of Inspector General (HHS OIG) has an active project underway to evaluate the availability and completeness of the electronic visit verification data and how states are using the data to promote program integrity.
An HHS OIG report finds that in fiscal year 2024, there were 298 fraud convictions....
AI and Research in Medicine and Other Fields
https://www.cbsnews.com/news/ai-hallucinate-citations-medial-research/?intcid=CNR-02-0623
AI is fabricating citations in biomedical studies, researchers find
By Megan Cerullo
Updated on: May 13, 2026 / 5:09 PM EDT / CBS News
Artificial intelligence is fabricating references to medical research that does not exist, according to recent findings.
A recent audit found that, among millions of biomedical papers, more than 4,000 contained citations to non-existent research, according to an article in The Lancet. Such fabricated citations can undermine the clinical guidelines that health care professionals rely on to provide care, said Maxim Topaz, an associate professor at the Columbia School of Nursing and the study's lead author.
An audit of millions of biomedical papers found more than 4,000 citations to bogus studies, the researchers said in a recent article published in The Lancet.
Fabricated citations are dangerous because they influence clinical guidelines, which are based on public research that health care professionals follow in providing care, Maxim Topaz, an associate professor at the Columbia School of Nursing and the study's lead author, told CBS News.
"When those fake references are making it into the literature, they will end up in those guidelines, and that's how doctors decide how to provide care for you," he said. "Your doctor could be making decisions around treatment based on studies that never existed."
Growing problem
Also troubling is that none of the mistakes Topaz and his team identified have been corrected or retracted, and could still be influencing patient care, he said.
"The rate of fake references showing up in published medical literature is growing," Topaz added, noting that the number of such erroneous citations has grown 12-fold over the last three years. The fabricated references spanned nearly 3,000 academic papers.
Topaz's own experience spurred him to investigate the issue. An AI app he was using to help polish one of his own scientific papers inserted a fake citation, he told CBS News. It then slipped through several layers of peer reviews before one sharp-eyed editor caught the phony reference.
"I was mortified, because I've been studying AI for the past 15 years, so if it can happen to me, it can happen to anyone," he said.
Such mishaps arise when an author asserts a statement of fact and asks AI for a citation, Topaz explained. "In some cases, AI would slip those in, inadvertently," he said. "You would hope the facts are accurate, but if they are supported by fabricated citations, you don't know if the 'facts' are accurate."
In some cases, an AI tool will also cite a real author while inventing research and attributing it to that person. Other times, citations were completely fabricated, Topaz said.
"This is just the tip of the iceberg," he said, noting that research across other fields could also be subject to the same issues.
Meanwhile, faux AI-generated scientific citations can "look perfectly real," Topaz added, who emphasized the importance of researchers rigorously fact-checking their work.
Sunday, May 10, 2026
AI Literacy Across the United States Workforce
https://blog.citp.princeton.edu/2026/05/05/make-america-ai-ready-strengths-weaknesses-and-recommendations/
What Does It Do Well?
It’s accessible. The choice of SMS for delivery maximizes reach. It meets people where they are, requiring no app installation, account creation, or navigating unfamiliar web platforms. The 10-minute-a-day pacing is practical.
It emphasizes verification of AI outputs. The course consistently emphasizes that AI output must be checked, not blindly trusted. The example of looking up a restaurant only to find out that a nail salon has opened in its place is memorable (Lesson 6, below). The course also thoughtfully extends this skepticism to AI-generated images, video, and audio.
It centers human responsibility. The quiz question about a coworker submitting an AI-generated report with fabricated statistics (Lesson 2, below) returns a sensible response: the human is responsible. This is repeated throughout the course and is one of its most important messages.
It’s honest about AI’s limitations. The course doesn’t shy away from the fact that AI can be confidently wrong. The term “hallucination” is introduced clearly, the concept of training data cutoffs is explained, and the course repeatedly emphasizes that AI predicts rather than knows or understands. For a 101-level course, this is appropriately calibrated.
What could be fixed in AI 101?
There are some things we’d recommend fixing about the course.
The course repeatedly contradicts its own privacy and security advice.
The course contains a serious inconsistency when it comes to data privacy and security. On the last day of the course it offers common-sense advice, stating “PROTECT your private info. Never share passwords, Social Security numbers, medical records, or confidential work data with AI tools,” later adding not to share “income data.” But some of the advice and exercises leading up to that point had already prompted users to input some of these “never share” types of data.
• On Day 3, the course urges the user to input a photo, PDF or recording of their own voice.
• On Day 4, it says that a “power move” is for users to “give AI your own data to work with,” including instructions to “paste your resume” and “share your monthly expenses.”
• On Day 5, the course says that a good use case for AI is putting “medical symptoms” in to learn medical terms and prepare questions for a doctor.
• On Day 6, it tells the user to share their address to find a restaurant near them.
These self-contradictions expose a central tension: AI tools can be more useful when they know more about you, so a blanket prohibition against sharing private information will limit their usefulness. Unfortunately, there is no simple answer to the question of how to protect your privacy when using AI, and there is no single approach that will work for everyone. It requires critical thinking based on an understanding of different threat models, including prompt injection risks, traditional cybersecurity risks, legal risks, AI companies’ eagerness to train on user data, and workplace policies that of course vary between organizations.
We recognize that this level of nuance would be too much for an introductory course. We would recommend that the privacy protection lesson come earlier in the course, and include information about privacy settings that AI tools offer, such as temporary or incognito chats. Instead of the “never share” language, giving people at least a rudimentary understanding of what could go wrong would be more helpful, along with links to resources where they can learn more.
The quizzes adopt a right-wrong dichotomy
The quiz questions often ask the user for an explanation of AI’s failure modes and social effects. While it is important to face these head-on, the questions consistently have one “obviously correct” answer that maps to the course’s framing. Several wrong answers are absurd strawmen (“AI likes making things up to test you,” “AI’s internet connection was slow”). This limits the potential to build genuine understanding or critical thinking about AI’s functioning and societal implications.
We would recommend an approach that highlights known issues without pretending that the explanations are simple. Flexibility in how issues are framed will allow course participants to grapple with them in a manner that is relevant to the skills they are building. More open-ended quiz questions might include: “Your employer starts mandating that all workers use AI. This may enable your employer to monitor your productivity. What are your options?” or “You are about to apply for a loan. How can you find out whether and how AI will be used in evaluating your application?”
What could DOL build upon in AI 201?
Expanding upon the introductory materials in the 101 course, there are several opportunities for content development that we would recommend.
The course misses how AI is reshaping work
For a course that is offered by the Department of Labor, there is very little content on the subject of work — the course frames AI solely as a productivity tool workers can use. The Department of Labor exists to protect workers, their wages, their safety, and their rights, yet the course largely skips over the ways AI is already reshaping hiring, performance monitoring, and layoffs of workers across many sectors.
An AI 201 course could provide more information on these, and inform citizens of legitimate reasons they may have to call for regulation. It could also go into more depth on the privacy question. Finally, AI 201 could reckon with the broader societal consequences of this technology: for instance, bias, surveillance, and the concentration of power in the hands of a few large technology companies. Workers who understand these dynamics are not just AI-literate; they are better equipped to advocate for themselves.
Deepening Technical Explanations
The 101 course keeps its terminology simple, which is important. But sometimes it oversimplifies. An AI 201 could deepen the explanation of how models are trained, make inferences, and deliver human-interpretable results.
The course’s technical explanation — AI finds patterns and makes predictions — serves as the entire mental model. This framing makes AI sound more mechanistic and less opaque than it actually is. On day 3, the language of pattern and prediction drops out, with the language of “instruction” and “results” substituting in for the human input and predicted output of AI. The current course also equates predicting with guessing and AI training with “studying” – analogies that might be a useful starting point, but are quite limiting.
For an AI 201 course, the connections between AI learning, model weights and predictions – as well as the connections between all of these things and the results generated from instructions – could be deepened. Indeed, how AI can be biased, can hallucinate, and otherwise can make errors is easier to comprehend when one understands a bit of the math behind machine learning.
More Active Learning Engagement
The quizzes in AI 101 are based on reputable learning science. Often the quiz will introduce a new concept or ask the user to stretch what they just learned to cover a new situation. There’s good evidence to think that this sort of “pre-assessment,” followed quickly by lessons teaching the correct answer, does improve retention in general.
But as we said the AI 101 quiz questions consistently have one “obviously correct” answer that maps to the course’s framing, limiting the potential to challenge the user’s understanding. Additionally, we found minimal tailoring of text-message responses to the user’s quiz answers, despite the affordances of the interactive platform. If one user selects what is considered a right answer while another selects a wrong one (we tested this), the course responds with similar if not identical information. Better quizzes in AI 201 could perhaps be assessed by an LLM, with adaptive responses that meet the user where they are, and stretch their understanding when they’ve acquired a solid base.
The daily challenges in AI 101 (Quick Draw, Udio music generation, fridge photo recipes) are well-designed to get people past the intimidation barrier. They’re low-stakes, fun, and demonstrate AI capabilities concretely. But for AI 201 they could be more effectively leveraged to actually show people how AI can be useful in their work and daily lives, and can (as promised by AI 101) “save them 5 hours per week”.
Who created the course, and how?
The DOL’s press release announcing the course points to a collaboration with a private partner called Arist. Arist’s website at the time of writing states that “Arist is the #1 enablement AI. Arist’s agents orchestrate creation, delivery, and analytics, end-to-end.”
While the DOL announcement gives little detail as to the nature of the collaboration, if the company co-developed actual course content using generative AI this fact should be disclosed. One of us ran selected course content through Pangram, a tool which purports to detect AI content, and the results came back suggesting it was 100% AI-generated. Without putting too much stock in that, we began to suspect that some of the faults in the course could be explained this way. The simplistic framing of how AI generates results (patterns/predictions, instructions/results) could come from AI: since LLMs are trained on old explanations of how LLMs work, they may reach for framings that are not up-to-date. Also, if each module/quiz was generated separately, that could explain abrupt changes in terminology and the contradictions we identified regarding the sharing/not sharing of private information. The use of AI for content creation isn’t a problem per se; but the failure to disclose left a missed opportunity for a teachable moment on the utility and risks associated with generative content. Also, the contradictions in regards to security and privacy, which we discussed earlier, should have been caught by human oversight.
Additionally, going forward, transparency about how commercial partners are involved can lend itself to wider adoption and trust of course materials and DOL initiatives. The final lesson of the course refers users to an Arist-sponsored AI summit featuring Tony Robbins and Dean Graziosi. While the Summit appeared to be free, it raises the question of what other paid AI-enablement sessions or products these well-known coaches might offer. Graziosi has drawn attention for his role in other problematic training programs. Users deserve to know who benefits from pursuing the recommendations made by a Federal agency.
Conclusion
Make America AI Ready offers significant insight into the priorities the Federal government holds in reaching widespread AI-literacy across the United States workforce. Although we suggested several areas for development, the course content and manner in which it was released are a useful start in achieving this aim.
War and Love
Friday, May 8, 2026
Myths about Sleep
https://www.npr.org/2024/01/09/1196978496/debunking-popular-myths-about-sleep
To help educate the public about healthy sleep, Robbins and her colleagues identified popular myths about sleep and debunked them in a 2019 paper published in the journal Sleep Health. They looked at statements such as "many adults need only 5 or less hours of sleep" and "it does not matter what time of day you sleep." And they found that these claims had "a limited or questionable evidence base."
Robbins walks through some of these myths with Life Kit and shares some much-needed tips on how to get better sleep.
MYTH 1: It doesn't matter what time of day you sleep
"Unfortunately, the time of day does matter," says Robbin. Our circadian rhythm — the internal circuitry that guides the secretion of the essential sleep hormone melatonin — is "significantly influenced by natural sunlight in our environment."
When the sun comes up and we go outside, that sunshine "stops the floodgates of melatonin and switches the 'on' phase of our circadian rhythm," she says.
"Conversely, going into a dark environment is what allows for the secretion of melatonin," she adds.
MYTH 2: One night of sleep deprivation will have lasting effects
If you had a bad night of sleep, don't stress — just get back to your normal sleep routine as soon as possible, says Robbins.
But those effects likely resolve with recovery sleep. So if you have an off night, don't beat yourself up about it, says Robbins. Instead, try to get back on track with your normal sleep schedule as soon as possible.
MYTH 3: Being able to fall asleep anytime, anywhere is a good thing
Being able to fall asleep in random places, like your desk, isn't a good thing. It takes a well-rested, healthy person about 15 to 20 minutes to fall asleep, says Robbins.
"It's a myth that a good sleeper would be able to hit the pillow and fall asleep right away," says Robbins. "This is because sleep is a process."
It takes a well-rested, healthy person about 15 to 20 minutes or maybe a little bit longer to fall asleep, she adds.
MYTH 4: You can survive on less than five hours of sleep
Some people brag about needing only a few hours of sleep at night. That may come from the notion in our high-performing society that "well-rested people are lazy," says Robbins — "which is a myth."
The reality is that adults need about seven to nine hours of sleep a night, she says. "That's where we see the most optimal health [outcomes]: improved heart health, longevity and brain health into our older years."
Sleeping less than seven hours a night can result in weight gain, obesity, diabetes and hypertension, according to a statement from the American Academy of Sleep Medicine and the Sleep Research Society. It's also associated with impaired immune function, impaired performance and increased errors — like "sending an email to the wrong person or entering incorrect numbers in a spreadsheet," says Robbins.
So if you can, try to hit that goal of sleeping seven to nine hours as many nights of the week as possible, she adds. You'll know that you've hit your sweet spot when you "wake up feeling refreshed, have energy throughout the day and are not reaching for coffee or energy drinks in the afternoon."
MYTH 5: Watching TV is a good way to relax before bedtime
Watching a show on a device that emits heat, like a laptop positioned on your stomach, can deter your ability to fall asleep, says Robbins.
MYTH 6: Exercising within four hours of bedtime will disturb your sleep
What the research does show is that exercise and sleep appear "mutually beneficial," wrote Robbins and her colleagues in their paper. One analysis of several research papers found that people who consistently exercised saw "small to moderate improvements in sleep."
"Exercise releases endorphins, which are mood elevators that can help with the No. 1 cause of sleep difficulties: stress," she says.
For that reason, Robbins encourages people to exercise — even if it's close to bedtime. "If that's the only time you can get a workout in, go for it."
Subscribe to:
Posts (Atom)