By Christopher Berry in newsletter — Oct 14, 2025

Tiny Stories, Hidden Games

This week: TinyStories, TeachLM, social sycophancy, anti-scheming, hidden games

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Around 2.5 million parameters

“In this paper, we introduce TinyStories, a synthetic dataset of short stories that are intended to contain only words that most 3 to 4-year-old children would typically understand, generated by GPT-3.5 and GPT-4. TinyStories is designed to capture the essence of natural language, while reducing its breadth and diversity. Each story consists of 2-3 paragraphs that follow a simple plot and a consistent theme, while the whole dataset aims to span the vocabulary and the factual knowledge base of a 3-4 year old child.”

“Our main contribution is that we show TinyStories can be used to train and evaluate SLMs2 that are much smaller than the state-of-the-art models (below 10 million parameters with an embedding dimension of 256), or have much simpler architectures (with only one transformer block), yet still produce a diverse set of fluent and consistent stories that are comparable or superior to those generated by larger and more complex models. Moreover, despite of the small size of the models, we still observe an emergence of reasoning capabilities, knowledge of general facts and ability to follow certain instructions.”

“Finally, we have presented initial findings which point to the roles of width vs. depth in the intellectual capabilities of generative networks, which suggest that width is more important for capturing factual knowledge whereas depth is more important for contextual tracking.”

Eldan, R., & Li, Y. (2023). Tinystories: How small can language models be and still speak coherent english?. arXiv preprint arXiv:2305.07759.

https://arxiv.org/abs/2305.07759

TeachLM: Post-Training LLMs for Education Using Authentic Learning Data

Tutoring is an entirely different state from helpful assisting

“In his seminal 1984 study, educational psychologist Benjamin Bloom demonstrated that one-on-one tutoring can yield learning gains two standard deviations above those achieved through traditional classroom instruction [1].”

“A fundamental issue with LLMs is that they have been optimized to act as “helpful assistants” that maximize productivity and minimize cognitive labor [12–20]. This contrasts with the natural friction that expert teachers introduce into learning (for example, by withholding the right answer and prompting students to first attempt a response) [21–23]. Effective tutoring also requires dynamic adaptation to learners’ states of mind as opposed to one-size-fits-all instructional designs [24]. This tendency toward friction minimization and sycophantic behavior—prioritizing compliance over pedagogy—is systematically encoded in model parameters through supervised fine-tuning [12] and reinforcement learning from human feedback (RLHF) [25, 26]. These processes rely on datasets produced by human annotators instructed to provide responses that maximize completeness while minimizing the number of conversational turns [15].”

“Our early efforts focused on establishing the simple, measurable benchmarks outlined in this report, but more sophisticated evaluations are needed to capture the full richness of human pedagogy. Specifically, we will prioritize creating benchmarks that capture the nuances of longitudinal interactions between students and tutors. These types of interactions are only possible over extended periods of time and are critical in fostering rapport and driving tangible learning outcomes.”

Perczel, J., Chow, J., & Demszky, D. (2025). TeachLM: Post-Training LLMs for Education Using Authentic Learning Data. arXiv preprint arXiv:2510.05087.

https://arxiv.org/abs/2510.05087

Love to see anybody building off Goffman!

“Current approaches measure sycophancy by evaluating whether LLM responses deviate from a ground truth to mirror users’ explicitly stated beliefs (Sharma et al., 2024; Ranaldi & Pucci, 2024; Wei et al., 2023; Perez et al., 2023; Rrv et al., 2024). But such measurements apply only to explicit statements (e.g., “I think Nice is the capital of France.”) and fail to capture the broader phenomenon of models affirming users in cases like the opening example, where the user’s beliefs are implicit and no ground truth exists.”

“Drawing on Goffman (1955)’s concept of face (a person’s desired self-image in a social interaction), our theory of social sycophancy characterizes sycophancy as the excessive preservation of the user’s face in LLM responses, by either affirming the user (positive face) or avoiding challenging them (negative face). This theory encompasses existing sycophancy definitions (Table 1), enables capturing new dimensions of sycophancy, and motivates a new benchmark ELEPHANT. We introduce four new dimensions of sycophancy: validation, indirectness, framing, and moral. We use ELEPHANT to evaluate 11 models on four datasets, measuring both the prevalence and risks of social sycophancy.”

“Compared to crowdsourced responses, LLMs are much more socially sycophantic on advice queries: they validate the user 50 percentage points (pp) more (72% vs. 22%), avoid giving direct guidance 43 pp more (66% vs. 21%), and avoid challenging the user’s framing 28 pp more (88% vs. 60%). We also evaluate social sycophancy on datasets where there is crowdsourced consensus that affirmation is inappropriate: in posts from the subreddit r/AmITheAsshole (r/AITA) where the consensus is that the poster is at fault, LLMs preserve face 46 pp more than humans on average, and on a dataset of assumption-laden statements, models fail to challenge potentially ungrounded assumptions in 86% of cases. Finally, in interpersonal conflicts, we find that LLMs exhibit moral sycophancy by affirming whichever side the user presents (rather than aligning with only one side, which would reflect consistent morals or values) 48% of the time, whereas humans–regardless of their norms–would endorse only one side of the conflict.”

Cheng, M., Yu, S., Lee, C., Khadpe, P., Ibrahim, L., & Jurafsky, D. (2025). Social sycophancy: A broader understanding of llm sycophancy. arXiv preprint arXiv:2505.13995.

https://arxiv.org/abs/2505.13995

Stress Testing Deliberative Alignment for Anti-Scheming Training

Plans sabotage with an ally

“Highly capable AI systems could secretly pursue misaligned goals – what we call “scheming”. Because a scheming AI would deliberately try to hide its misaligned goals and actions, measuring and mitigating scheming requires different strategies than are typically used in ML. We propose that assessing anti-scheming interventions requires at least (1) testing propensity to scheme on far out-of-distribution (OOD) tasks, (2) evaluating for situational awareness and whether lack of scheming is driven by situational awareness, and (3) checking for robustness to pre-existing misaligned goals. We use a broad category of “covert actions”–such as secretly breaking rules or intentionally underperforming in tests–as a proxy for scheming, and design evaluations for covert actions.”

“Residual covert actions and limited coverage. Our anti-scheming intervention reduces but does not eliminate covert actions (e.g., from 8.7% to 0.3% for OpenAI o4-mini), with failures persisting across multiple environments. Moreover, our test environments do not span the full space of contexts where covert behaviors may arise; as such, even if our method achieved strong performance on test, that would be insufficient to “rule in” the technique for deployment.”

“Measurement overfitting and leakage. The iterative nature of developing interventions risks Goodharting, as knowledge of the test environments can leak into the training procedure. To mitigate this, we advocate for using held-out environments and detectors that are run only once at the end of a research cycle.”

“Persistence of unintended goals. Our current approach only partially addresses pre-existing goals. In our goalinsertion experiment (Section 7), anti-scheming training suppresses the inserted goal, but the model retains knowledge of it and can be red-teamed to exhibit covert behavior. Future goals may be more deeply embedded from higher-compute RL and thus harder to remove.”

Schoen, B., Nitishinskaya, E., Balesni, M., Højmark, A., Hofstätter, F., Scheurer, J., ... & Hobbhahn, M. (2025). Stress Testing Deliberative Alignment for Anti-Scheming Training. arXiv preprint arXiv:2509.15541.

https://www.arxiv.org/abs/2509.15541

https://www.antischeming.ai/

The Hidden Game Problem

So adversarial!

“This paper investigates a class of games with large strategy spaces, motivated by challenges in AI alignment and language games. We introduce the hidden game problem, where for each player, an unknown subset of strategies consistently yields higher rewards compared to the rest. The central question is whether efficient regret minimization algorithms can be designed to discover and exploit such hidden structures, leading to equilibrium in these subgames while maintaining rationality in general. We answer this question affirmatively by developing a composition of regret minimization techniques that achieve optimal external and swap regret bounds. Our approach ensures rapid convergence to correlated equilibria in hidden subgames, leveraging the hidden game structure for improved computational efficiency”

“Our results show that hidden structure in large games can be exploited algorithmically without sacrificing rationality, yielding convergence to correlated equilibria with runtime independent of the ambient action space. This advances the theory of learning in games with exponentially large action spaces and connects to recent motivations from AI alignment and language games.”

“Our model assumes a consistency condition in which actions in the hidden set dominate those outside. This captures settings such as debate, where a subset of arguments consistently outperforms irrelevant ones, but may be restrictive in other domains. More general structures might allow conditioning on the opponent’s action or introducing random rewards outside the hidden set. In addition, our analysis focuses on adversarial swap regret minimization, which is natural for debate-style adversarial games but potentially conservative in cooperative or partially aligned settings.”

Buzaglo, G., Golowich, N., & Hazan, E. (2025). The Hidden Game Problem. arXiv preprint arXiv:2510.03845.

https://arxiv.org/abs/2510.03845

Reader Feedback

“A lot of managers don’t quite trust AI with anything just yet.”

Footnotes

The process of achieving a goal with a set of Virtual Twin of Customers (VTOCs) is dense because it’s an entire methodology. I have to start somewhere. So let’s start with a user goal.

My opening focal problem statement is: “I’m trying to decide [whether | if | which] [option] and the incertitude is causing me pain.” Note that there is no intent to HARK (Hypothesizing After Results are Known). They’re in an exploratory, not a explanatory, state.

How does one navigate that forest?

My opening stance is linear scrolling.

A typical Jupyter Notebook is a linear scroll. The SPSS Viewer is a linear scroll. A classical RStudio interface has one linear scroll and three semi-static menus that enable a third dimension: you can click in on some of them and see raw data, file folders, and interact with charts.

Outside of exploratory data analysis, consumers are used to linear scrolls: the conversational interface, the search engine results page page, any newsletter, feed, novel, podcast, and video. The Netflix and Youtube interfaces are a 2x2 scroll with a third interactive dimension.

In the future, maybe some users will want to compare VTOC responses, and that the bento box 4, 6, or 8 panel format may be most desirable for that use case. At this time though, I haven’t heard anybody express that need.

And it’s already easy to get lost in the data as it is.

Linear today. It’ll depend tomorrow.

Never miss a single issue

Be the first to know. Subscribe now to get the gatodo newsletter delivered straight to your inbox