Copying and mimicking is perhaps their safest decision

AI productivity dramaaaaaa, LLM world-models, Global Flourishing, blockbusters and the nature of innovation

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

“we find that allowing AI actually increases completion time by 19%—AI tooling slowed developers down”

“We conduct a randomized controlled trial (RCT) to understand how AI tools at the February June 2025 frontier affect the productivity of experienced open-source developers. 16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 years of prior experience. Each task is randomly assigned to allow or disallow usage of early-2025 AI tools. When AI tools are allowed, developers primarily use Cursor Pro, a popular code editor, and Claude 3.5/3.7 Sonnet. Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%—AI tooling slowed developers down.”

beckler-1.png

“Our study primarily complements existing literature measuring the impact of AI on software development by:

  1. Testing AI models at the February–June 2025 frontier,
  2. Using unfiltered, “live” open-source repository tasks rather than synthetic or cherry-picked tasks,
  3. Using a fixed outcome measure (speedup on tasks defined before randomized treatment assignment),
  4. Recruiting experienced engineers with years of expertise in the target repositories, and
  5. Collecting rich data on time usage, AI code suggestions, and developers’ qualitative experiences.”
beckler-2.png

“Using entry and exit surveys, screen recordings, developer interviews, and subset analyses we find qualitative and quantitative evidence that 5 of the 20 factors contribute to slowdown, we find mixed/unclear/no evidence that 9 of the factors contribute to slowdown, and we find evidence against 6 of the factors contributing. However, we strongly caution against over-indexing on the basis of any individual pieces of evidence, as we are not powered for statistically significant multiple comparisons when subsetting our data. This analysis is intended to provide speculative, suggestive evidence about the mechanisms behind slowdown.”

https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

Can large language models play text games well? current state-of-the-art and open questions

“Games are a microcosm of human life”

“In this technical report, we take an initiative to investigate their capacities of playing text games, in which a player has to understand the environment and respond to situations by having dialogues with the game world. Our experiments show that ChatGPT performs competitively compared to all the existing systems but still exhibits a low level of intelligence. Precisely, ChatGPT can not construct the world model by playing the game or even reading the game manual; it may fail to leverage the world knowledge that it already has; it cannot infer the goal of each step as the game progresses.”

tsai-1.png

“In this section, we feed the correct walkthrough to ChatGPT and test whether it can learn the world model of Zork by reading the walkthrough. This is important because world models are widely believed to be a key building block of the road towards human-level intelligence (Ha & Schmidhuber, 2018; Matsuo et al., 2022).”

tsai-2.png

“ as the game progresses, we repeatedly ask ChatGPT what it thinks the current goal is and see if it can say anything meaningful at any point.”

“There is also evidence that ChatGPT’s inability to learn a world model also hinders its ability to infer goals. In particular, ChatGPT often wants to explore the locations and routes that it has already visited.”

“Maybe all LLMs will live with this problem: after all, they are pretrained to predict future tokens given contexts; when they have no clues, copying and mimicking is perhaps their safest decisions.”

Tsai, C. F., Zhou, X., Liu, S. S., Li, J., Yu, M., & Mei, H. (2023). Can large language models play text games well? current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868.

https://arxiv.org/abs/2304.02868

The Global Flourishing Study: Study profile and initial results on flourishing

“Are we sufficiently investing in the future given the notable flourishing–age gradient, with the youngest groups often faring the most poorly?”

“The Global Flourishing Study is a longitudinal panel study of over 200,000 participants in 22 geographically and culturally diverse countries, spanning all six populated continents, with nationally representative sampling and intended annual survey data collection for 5 years to assess numerous aspects of flourishing and its possible determinants.”

gfs-1.png

“Flourishing is an expansive concept and the working definition underpinning the GFS has been ‘the relative attainment of a state in which all aspects of a person’s life are good, including the contexts in which that person lives.’ Several aspects of this definition are important. First, flourishing is multidimensional—it concerns all aspects of a person’s life. One may be flourishing in certain ways but not in others. No assessment of flourishing will ever fully measure flourishing, only aspects of it. Second, flourishing may be conceived of as an ideal, but it also concerns the ‘relative attainment’ of that ideal. We are never perfectly flourishing in this life, and there is always room for improvement. Third, flourishing concerns both objective and subjective aspects of life, although subjective aspects are more amenable to survey research. Fourth, the understanding of what is ‘good’ will vary across cultures and contexts, but there is arguably a great deal of common ground as well, and such common ground is a reasonable starting point for measurement5,18. Finally, flourishing includes the contexts in which a person lives; such con- texts include one’s communities and environment. While the terms ‘flourishing’ and ‘well-being’ are often used interchangeably, flourish- ing arguably has a connotation of also having the environment itself being conducive to growth and being a part of one’s flourishing.”

gfs-2.png

“One of the more concerning results is the relation with age. On average, when pooled across the 22 countries, flourishing is essentially flat with age through ages 18–49 and then increases with age thereafter. This is in striking contrast to earlier work—focused mostly on life satisfaction/evaluation—which had suggested a more dramatically U-shaped pattern with age. Even with life satisfaction, pooled over the 22 GFS countries, this is now more J-shaped than U-shaped.”

gfs-3.png

“Young people are not doing as well as they used to be. While causes are likely diverse, mental health concerns with young adults are clearly on the rise. These patterns are not universal. As noted, in some countries the patterns concerning flourishing and age are still somewhat more U-shaped (India, Egypt, Kenya, Japan) and in others (Poland, Tanzania) decreasing with age. Nevertheless, the overall global pattern is troubling.”

gfs-4.png

“Are we sufficiently investing in the future given the notable flourishing–age gradient, with the youngest groups often faring the most poorly? Can we carry out economic development in ways that do not compromise meaning and purpose and relationships and character, given that many economically developed nations are not faring as well on these measures? With economic development and secularization, have we sometimes been neglecting, or even supressing, powerful spiritual pathways to flourishing? The very word ‘flourishing’ can arguably be used either as an abstract noun to indicate a state (as per the composite flourishing index) or to suggest a dynamic process of growth. How can each nation grow and flourish? If society is to ultimately pursue flourishing, these questions of age, and of economic development, and of spiritual dynamics need to be taken into consideration.”

VanderWeele, T. J., Johnson, B. R., Bialowolski, P. T., Bonhag, R., Bradshaw, M., Breedlove, T., ... & Yancey, G. (2025). The Global Flourishing Study: Study profile and initial results on flourishing. Nature Mental Health, 1-18.

https://www.nature.com/articles/s44220-025-00423-5

https://techcrunch.com/2025/07/10/former-intel-ceo-launches-a-benchmark-to-measure-ai-alignment/

Blockbusters, Sequels and the Nature of Innovation

The Walking Dead

“Using detailed product and invention level data from the pharmaceutical industry, we demonstrate that firms with particularly high-selling “blockbuster” products concentrate their development efforts on new products that both target the same customer segments and are more likely to be technically similar to existing blockbuster products.”

“We argue that when a firm’s product achieves outsized sales (i.e., “blockbuster” status), the firm has an incentive to develop technically similar products targeting the same customer segment—distinguished by the buyer needs and preferences specific to the blockbuster product.

The logic behind our argument is simple. Consider a firm seeking to replace a product nearing the endof its life,a circumstance common to innovation-intensive industries. When such a product achieves high sales, we argue that the firm will tend to develop follow-on products that target the same customer segment in order to sustain the original product’s superior performance. The rationale for this behavior reflects a key assumption: that firms believe demand for their product is “sticky,” meaning that if a successor product appeals to the same segment (i.e., the needs and preferences of its current customers), it will likely achieve comparable sales.2 We further argue that, in order to serve the same customer segment, firms are more likely to develop products that are technologically similar.”

“We employ product-level sales data from IQVIA for all drug product sales in the U.S., U.K., France, and Germany from 1998 to 2010. We restrict our focus to the sales of branded therapeutic small molecule(chemical-based)and large molecule (biologic-based)drugs.14 We also collect drug development data from Cortellis, an extensive dataset covering global drug development activities. The raw data includes 79,309 drug development projects conducted by 13,050 companies from 1980 to 2020.”

wesley-1.png

“The empirical results support our core arguments. Specifically, we find that as product sales increase within a well-defined, narrow customer segment, investment in product development for that segment increases. In addition, we also find that, with the increasing sales of a drug, firms are also more likely to invest in the development of a technology that closely resembles that of the existing high-selling product. We also notably find evidence supporting the role of our featured mechanism, demand stickiness, in driving this behavior. Specifically, with a measure of a source of demand-stickiness in the pharmaceutical industry—the incidence of adverse side effects of a drug—we provide evidence supporting the role of demand stickiness in driving both customer segment and technological similarity between high-selling drugs and the drugs under development.”

Cohen, W. M., Higgins, M. J., Miles, W. D., & Shibuya, Y. (2025). Blockbusters, Sequels and the Nature of Innovation (No. w33957). National Bureau of Economic Research.

Reader Feedback

“The idea that LLM’s are limited by collective intelligence is kind of reassuring?”

Footnotes

I’m happy to have learned it this week. Some teams don’t engage in a Riskiest Assumption Test (RAT) because they won’t survive the outcome. Others don’t engage because they don’t know what they don’t know.

This may be a bit of James Burke redux, but briefly, most new inventions are made out of connections from prior knowledge. To pull off the printing press, Gutenberg needed to combine a grape press, ink, paper, and moveable type. And paper was important as the substrate given the concept of ink…of staining in a particular manner. What a mindshift. The riskiest assumption was technical, not market. Books were expensive. There was a demand for them, in particular amongst the few people who could read (the original whales of the publishing industry!) Those who could read had a lot of money to pay for them. Demand was evident. Meeting that demand was not. There has got to be a better way.

To pull off yet another coupon app, one needs to combine front end code with bar codes with API’s with a content management system, and because it’s 2025, a LLM for some reason. Sure. Why not.

The riskiest assumption isn’t technical. All of those components can be combined. It’s known technology. The riskiest assumption is demand. Who needs a coupon app? Who needs their coupon app? What differences would make a difference?

If the purpose of the team is to develop a coupon app, then testing the riskiest assumption is unwanted activity. What would it change?

Never miss a single issue

Be the first to know. Subscribe now to get the gatodo newsletter delivered straight to your inbox

Subscribe to gatodo

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe