Saturday, March 7, 2026

AI and Jobs Hunters

https://www.businessinsider.com/hiring-managers-arent-reading-resumes-slop-2026-3 RIP résumés Slop is killing the résumé. Job hunters are scrambling for new ways to stand out By Amanda Hoover Mar 3, 2026, 3:17 AM CT A decade ago, I walked into an office to interview for my first newsroom internship. Wearing a millennial-core business casual H&M pencil skirt and Steve Madden flats, I handed my résumé — neatly spaced Arial font, carefully considered, and kept crisp in its designated folder — to the editor. Without looking up from her computer, she said, "I don't read résumés," and flicked the paper to the floor. If you've ever assumed an automated applicant tracking system has thrown out your résumé, I can tell you it feels just as demoralizing to watch it happen IRL. Today, more hiring managers and recruiters are following that approach. Now that anyone can spin up a buzzword-filled résumé and cover letter in seconds with ChatGPT, doctor a flawless headshot, or cheat a coding test, faked or embellished applications have become indiscernible from quality candidates. The résumé has been relegated. "Resume not your thing? That's great, we don't really read them anyway!" reads a job post for an engineer at Expensify. "While we know you're awesome, it's actually really hard and time consuming to find you in the midst of literally hundreds of other applications we get from everyone else." The post goes on to list five questions applications should answer to be considered. "We don't require a résumé, and we don't expect one," notes a software engineering job at Automattic, which owns WordPress.com and Tumblr. Some employers are focusing more on a person's enthusiasm and skills than shiny credentials. E-commerce platform Gumroad asks prospective software engineers to send an email detailing why they want to work there, what they've built, and, if selected, to participate in a paid four-to-six-week work trial. Research has long shown that résumés alone with impressive companies and years of experience aren't great predictors of success in a new job. Now, in the age of Gen AI slop, "the résumé is almost worthless because they all read the same," says Michelle Volberg, founder and CEO of Twill, a recruiting software company. She compares AI-edited résumés to going to a restaurant where "the menu looked really beautiful and had all these amazing ingredients and dishes, but there was no one there actually making the food." Volberg tells me she's seen a shift just in the past three months: some companies she works with are opting to extend paid work trials for as long as a month to evaluate a candidate. Some are focused more on workers' real-time abilities than if they've worked at a Big Tech company or went to an Ivy League school. A new survey from the National Association of Colleges and Employers found that 70% of employers say they're using skills-based hiring, which prioritizes practical abilities and aptitudes over credentials like degrees and years of experience. A résumé might still be used to identify and track a candidate, Volberg says,….. Mar 3, 2026, 3:17 AM CT https://arxiv.org/pdf/2602.18550 Measuring Validity in LLM-based Resume Screening Jane Castleman, Zeyu Shen, Blossom Metevier, Max Springer, Aleksandra Korolova Princeton University. Princeton, NJ Abstract Resume screening is perceived as a particularly suitable task for LLMs given their ability to analyze natural language; thus many entities rely on general purpose LLMs without further adapting them to the task. While researchers have shown that some LLMs are biased in their selection rates of different demographics, studies measuring the validity of LLM decisions are limited. One of the difficulties in externally measuring validity stems from lack of access to a large corpus of resumes for whom the ground truth in their ranking is known and that has not already been used for LLM training. In this work, we overcome this challenge by systematically constructing a large dataset of resumes tailored to particular jobs that are directly comparable, with a known ground truth of superiority. We then use the constructed dataset to measure the validity of ranking decisions made by various LLMs, finding that many models are unable to consistently select the resumes describing more qualified candidates. Furthermore, when measuring the validity of decisions, we find that models do not reliably abstain when ranking equally- qualified candidates, and select candidates from different demographic groups at different rates, occasionally prioritizing historically-marginalized candidates. Our proposed framework provides a principled approach to audit LLM resume screeners in the absence of ground truth, offering a crucial tool to independent auditors and developers to ensure the validity of these systems as they are deployed. ….. Discussion & Conclusion A central insight of this work is that validity and fairness are analytically separable: a model can be valid yet unfair, or consistent yet invalid. Many prior studies collapse these into a single moral question, whereas our framework treats them as distinct, measurable properties. This shifts the conversation from ethical aspiration to empirical reliability, which is precisely what high-stakes hiring systems require. Our findings show that validity in LLM-based resume screening should not be assumed, even for frontier models. While performance improves with model scale and with clearer qualification differences, both criterion and discriminant validity vary across settings, illustrating the necessity for more controlled testing. Our work contributes a principled framework for evaluating validity under controlled conditions and varying levels of task difficulty. These template-based methods provide a crucial tool for on-demand, scalable assessment across downstream use cases. In the rest of this section, we examine implications of our findings, discuss how to extend our framework for top-k rankings and longitudinal evaluations, and end with limitations and future directions. Validity and Over-Alignment. Our results reveal a complex tension between model validity and alignment. While we observe that validity generally scales with model size, our evaluations of discriminant validity uncover evidence of unequal selection rates that persists despite high criterion validity (Figure 3). In contrast with previous work [4], unequal selection rates favor Black and women candidates when candidates are equally qualified, suggesting that current post-training techniques designed to mitigate bias may be inducing a new form of invalidity where demographic signals override relevant qualifications [ 54]. Future evaluations must therefore treat validity and fairness not as orthogonal metrics, but as coupled objectives, ensuring that bias mitigation efforts do not compromise the fundamental reasoning capabilities required for accurate decision-making. Pairwise to Global Rankings. While our framework evaluates pairwise comparisons, practical deployment often requires ranking larger pools to identify the top-k candidates. Pairwise validity is sufficient for this, as it ensures the model’s preferences form a transitive structure (specifically a DAG), guaranteeing a coherent total ordering via topological sorting. Without pairwise validity, preference cycles make a top candidate mathematically undefined. Once pairwise validity is established, the LLM becomes a reliable comparator for efficient sorting algorithms or can be integrated into continuous scoring systems like Elo ratings [55] or tournament structures [56]. Longitudinal Evaluations. Our framework supports longitudinal evaluations by enabling repeated testing under evolving job descriptions and model versions. Because static benchmarks are vulnerableto rapid train-test contamination [57], they provide limited insight into how model behavior changes over time. By sourcing live job descriptions and constructing controlled counterfactual resume pairs, our approach mirrors metamorphic [26] and mutation testing [ 27] to generate novel, contamination-free evaluation sets. Furthermore, our framework explicitly controls task difficulty (via the number of qualification edits, k), incorporating principles from software and standardized testing [58, 27, 59]. Limitations. Our approach provides necessary but not sufficient conditions for validity; models that perform well under our metrics may still fail on subtler, real-world distinctions. Moreover, our study is limited to four demographic groups varying in race and binary gender. Studying broader demographics or attributes such as religion could reveal more nuanced effects on validity. Our ground truth captures discrete qualification differences under controlled conditions, which may not reflect the full complexity of real resumes or demographics. To maintain scalability and adaptability, we rely on LLMs to parse and generate resumes, which could introduce errors that propagate downstream. While this enables consistent comparisons across job descriptions and time, synthetic resumes may differ subtly from human-written ones, and results can vary across generation models. Consequently, synthetic evaluations should not be interpreted as lower bounds on real-world performance. Future work should apply our evaluation to broader model setups, including those with confidence threshold re-weighting, fine-tuning, or abstention enforcement under uncertainty, and measure their effects on validity and fairness. By doing so, our framework serves as a rigorous foundation for the automated compliance testing of real-world decision-making systems. p.10