By Christopher Berry in newsletter — Sep 23, 2025

Overly influenced by the architecture of the prompt

This week: Digital twins of over 2,000 people, simulation of 1,000 people, world modeling, pig-butchering, org design, wandb

Twin-2K-500: A Data Set for Building Digital Twins of over 2,000 People Based on Their Answers to over 500

A reference benchmark we can trust

“Despite the promise and excitement surrounding digital twins, some uncertainty remains. For example, Brucks and Toubia (2025) show that the answers provided by LLMs may be overly influenced by the architecture of the prompt, such as the labeling or ordering of options in multiple choice questions. Gui and Toubia (2023) show that leveraging LLMs to simulate experiments may introduce unwanted confounding due to the difficulty of clearly instructing the LLM how to draw variables not specified in the prompt. Other research (Santurkar et al. 2023, Motoki et al. 2024, Li et al. 2025) suggests LLMs tend to express opinions that are not representative of the (human) population.”

“In sum, to the best of our knowledge, there is no publicly available data set that combines rich psychological profiles, behavioral data, and demographics from a large, representative sample for the development and testing of digital twin simulations. As a result, researchers often rely on synthetic or proprietary data, which undermines transparency, reliability, and replicability. To address this gap, we assemble and publicly share an extensive data set from a representative sample of N 2,058 people who each answered more than 500 questions covering a wide range of demographic questions, psychological scales, cognitive performance questions, economic preferences questions, as well as replications of a wide range of within- and betweensubject experiments on heuristics and biases taken from the behavioral economics literature. The data were collected across four waves of studies lasting on average 2.42 hours per participant in total.”

“In initial tests, these twins predict human behavior with out-ofsample accuracy reaching 88% of the test-retest benchmark. Replication of average treatment effects is generally good, although further research is needed to determine if digital twins can capture nonnormative behaviors and reflect the full diversity of political and domain-specific views.”

Toubia, O., Gui, G. Z., Peng, T., Merlau, D. J., Li, A., & Chen, H. (2025). Twin-2K-500: A Data Set for Building Digital Twins of over 2,000 People Based on Their Answers to over 500 Questions. Marketing Science.

https://pubsonline.informs.org/doi/pdf/10.1287/mksc.2025.0262

Generative agent simulations of 1,000 people

Interesting use of interviews to build the simulations

“We contracted with the recruitment firm Bovitz (41) to obtain a U.S. sample of 1,000 individuals, stratified by age, census division, education, ethnicity, gender, income, neighborhood, political ideology, and sexual orientation. Participants completed interviews with the AI interviewer, along with Qualtrics versions of the General Social Survey (GSS), Big Five personality inventory, economic games, and selected experimental studies. For the GSS, we focused on 177 questions for the “core” module, excluding non-categorical questions, questions with more than 25 response options, and conditional questions.”

“For the GSS, the generative agents predicted participants’ responses with an average normalized accuracy of 0.85 (std = 0.11), calculated from a raw accuracy of 68.85% (std = 6.01) divided by participants’ replication accuracy of 81.25% (std = 8.11). These interview-based agents significantly outperformed both demographic-based and persona-based agents (Figure 2), with a margin of 14-15 normalized points. The demographic-based generative agents achieved a normalized accuracy of 0.71 (std = 0.11), while persona-based agents reached 0.70 (std = 0.11). An ANOVA of the accuracy rates rejected the null hypothesis of no significant difference (F(2, 3153) = 989.62, p < 0.001), and post-hoc pairwise Tukey tests confirmed that the interview-based agents outperformed the other two groups.”

Park, J. S., Zou, C. Q., Shaw, A., Hill, B. M., Cai, C., Morris, M. R., ... & Bernstein, M. S. (2024). Generative agent simulations of 1,000 people. arXiv preprint arXiv:2411.10109.

https://arxiv.org/abs/2411.10109

World Modeling with Probabilistic Structure Integration

Visionary

“In this work, we propose Probabilistic Structure Integration (PSI), a generic process for building richly controllable world models with the right interfaces for true interactivity – even when starting from low-level nonlinguistic raw input data. PSI consists of three steps that form a virtuous cycle. Step 1 involves the training of a Probabilistic predictor Ψ that answers flexible visual queries at a local, low level of detail–close to raw data. Step 2 involves extracting Structures from the predictor Ψ, corresponding to natural intermediate features for understanding the world – effectively, new token types that transcend the low-level raw inputs. Step 3 involves Integrating those new token types back into the predictor Ψ, using them as both new conditioning sources and prediction targets. The PSI process can be iterated, with each cycle enlarging the model’s control surfaces, enabling the construction of further new structures, and steadily increasing the predictive fidelity of the model itself. A key feature of PSI is that the model interface remains the same throughout the process, with no need for expanded architectures of complex new types as the model self-enriches.”

“But where should these intermediate quantities come from? Of course, it might be possible to use third-party supervised extractors to estimate them as needed. However, it turns out that a more conceptually satisfying and robustly performant approach is available. A true world model can be prompted to perform tasks it was not explicitly trained for. In Section 2 we explored various atomic conditioning prompts that allow us to control the generation of the model. Now we will show how we can compose several of those prompts to extract intermediate structures in a zero-shot fashion from Ψ.”

“In this work we show that spatially global structures (e.g. motion-derived object segments) can be extracted in a natural fashion using the PSI techniques. However, we only provide evidence of “closing the cycle” by integration gains for a local quantity (flow). We do not yet take advantage of global structures to create “object-centric” predictors. It is hoped that doing so would lead to both prediction performance and control precision gains. Ultimately, it will be useful to simultaneously integrate all the various intermediates we extract (e.g. flow, depth, segments).”

“Does PSI have anything to do with human neuroscience or cognition? Frankly, we don’t know yet. However, this is an obvious target for future work, both because it is an interesting class of questions in itself, and because we as a team are, in other lines of work, often engaged in comparisons of AI models to human behaviors and brains.”

Kotar, K., Lee, W., Venkatesh, R., Chen, H., Bear, D., Watrous, J., ... & Yamins, D. (2025). World Modeling with Probabilistic Structure Integration. arXiv preprint arXiv:2509.09737.

https://arxiv.org/pdf/2509.09737

"Hello, is this Anna?": Unpacking the Lifecycle of Pig-Butchering Scams

LongCon rebranded

“Pig-butchering scams have emerged as a complex form of fraud that combines elements of romance, investment fraud, and advanced social engineering tactics to systematically exploit victims. In this paper, we present the first qualitative analysis of pig-butchering scams, informed by in-depth semistructured interviews with N = 26 victims.”

“Pig-Butchering scams begin with the scammer establishing contact with the victim through various online platforms. Scammers create convincing personas and use unsolicited messages or interactions on dating apps and social media to engage the target. This phase is characterized by the construction of trust through seemingly incidental or calculated interactions.”

“After a bond has been established, the scammer introduces a fraudulent investment opportunity, typically framed as a low-risk, high-reward endeavor, such as cryptocurrency trading. The scammer uses fabricated success stories and social proof to persuade the victim of the legitimacy of the offer. This phase is designed to exploit the trust and rapport built in earlier stages, positioning the scammer as a trusted financial advisor or insider.”

“Scammers persuade victims to invest in fraudulent platforms that mimic legitimate financial services. Initial investments are typically small, with fabricated returns used to encourage further contributions. The scammer maintains ongoing communication, reinforcing the illusion of success and increasing the victim’s financial commitment by encouraging reinvestment of perceived gains.”

“Once the victim is fully engaged, scammers escalate pressure by introducing time-sensitive, high-return investment opportunities, often requiring large additional investments. Emotional manipulation intensifies, with scammers leveraging the victim’s trust and fear of missing out (FOMO) to secure further financial commitments. This phase aims to extract as much financial value as possible from the victim.”

“After squeezing the victim as much as possible, scammers abscond with the victim’s funds, typically after the victim attempts to withdraw their investments. Scammers employ stalling tactics, such as claiming technical issues or additional fees, to delay withdrawal. Once the victim realizes they have been defrauded, the scammer cuts all communication, disappearing from both the financial platform and social media.”

“In some cases, scammers re-engage victims by posing as law enforcement or recovery agents, offering assistance in recovering lost funds for a fee. This phase exploits the victim’s emotional vulnerability and desperation, further compounding their financial losses.”

“Replacing the term "pigbutchering" with more neutral, non-stigmatizing terminology could encourage victims to come forward and report their experiences. Adopting terminology that accurately describes the manipulative tactics without demeaning victims can reduce the stigma associated with reporting and discussing the scam. For example, terms like Grooming Investment Scam, LongCon Investment Fraud or Groom-and-Swindle Scheme capture the essence of the scam in a non-derogatory way.”

Oak, R., & Shafiq, Z. (2025). "Hello, is this Anna?": Unpacking the Lifecycle of Pig-Butchering Scams. arXiv preprint arXiv:2503.20821.

https://arxiv.org/abs/2503.20821

Organization design: Current insights and future research directions

“an important source of understanding of what it means to organize”

“Finally, contemporary studies point out that organizations may need to employ either structural or cognitive integration mechanisms to compensate for or to integrate divergent representations. For example, hierarchy can aid in information processing, promoting search (Gavetti, 2005), creating collective (shared) interpretations of a changing environment, and coming up with effective solutions (Lee & Csaszar, 2020). Other integration mechanisms that organizations may employ include: standards of action (Dougherty, 2001), which are vivid, simple representations of value creation that frame work and that are reenacted in practice; normative goal frames (Lindenberg & Foss, 2011), which are focal goals that reflect desired improvements for an individual’s group or unit;”

“Given its forward-looking and normative approach, it is only natural that the field of organization design should continue to provide a platform for a variety of perspectives, methods, and theories and for the expression of design as an area of scientific inquiry that provides an important source of understanding of what it means to organize.”

Joseph, J., & Sengul, M. (2025). Organization design: Current insights and future research directions. Journal of Management, 51(1), 249-308.

https://journals.sagepub.com/doi/full/10.1177/01492063241271242

Weights and Biases

Build better models faster

Use W&B to build better models faster. Track and visualize all the pieces of your machine learning pipeline, from datasets to production machine learning models.

https://github.com/wandb/wandb

Reader Feedback

“That ‘why models hallucinate’ paper read like a press release.”

Footnotes

The data quality problem hasn't been solved.

Un-dopped, LLM's top out around 88% accuracy. This week's digital twin study from Toubia et al in Marketing Science is excellent. And look at the error bars! Those sure do look like social science, not physics, error bars to me.

Remind you of another accuracy asymptote?

Text classification classic, right? Didn't we just spend the last 17 years wrestling with this?

Plus ça change, plus que c'est la même chose.

That's right, all the mêmes.

Never miss a single issue

Be the first to know. Subscribe now to get the gatodo newsletter delivered straight to your inbox