Pandora Opens It

This week: Continual learning trajectories, Pandora’s regret, Dflash, co-packaged optics supply chains, some pytorch

Continual Learning Requires Evaluating Trajectories

Good idea!

“AI systems increasingly incorporate continual learning mechanisms allowing their behaviour to adapt after deployment, from (1) in-context learning and (2) memory features already in wide use to (3) post-deployment weight modification under research. We argue that, by treating AI systems as frozen artefacts whose performance and safety are assessed at release, current evaluation practices structurally ignore the behavioural trajectory of a system that continues to learn from experience. Our position is that evaluation of continual learning systems should be centred on behavioural trajectories, with the complementary goals of characterising the landscape of possible behaviours and forecasting how behaviour will evolve from a given set of experiences. This can be operationalised through trajectory elicitation sandboxes and predictive monitors that forecast behavioural evolution, but may face fundamental obstacles analogous to those seen in dynamical systems. These are best addressed by (1) applying trajectory-centred evaluation to today's continual learning systems and (2) relying on the resulting evidence to design systems amenable to it, yielding a virtuous cycle in which systems and their evaluations co-evolve.”

Pacciardi-1.png

“Evaluators should start trajectory-centred evaluation on today's systems, extending current CL1 and CL2 benchmarks into full sandbox-and-monitor suites that yield evidence on where chaos and multi-attractor regimes actually arise. Based on these findings, developers should design CL mechanisms amenable to evaluation, through directions such as contractive update ruleschoice of intrinsic objectives (e.g., curiosity or novelty)gated adaptation, and circuit-breakers that pause or roll back learning.”

Picciardi-2.png

Pacchiardi et al (2026) Continual Learning Requires Evaluating Trajectories

https://cl-eval.github.io/

If only she didn’t know

“When a patient arrives at the emergency room with abdominal pain, a physician does not test for every condition simultaneously. Instead, diagnosis proceeds as a sequential search: conditions are evaluated in an order determined by their likelihood and the cost of testing.”

“In sequential search, alternatives are tested until the true class is found. Standard proper scoring rules like log loss are local, ignoring the ranking of competitors and misaligning model evaluation with search utility. We show that sequential search induces a pairwise structure that overcomes this. By analyzing the expected cost of optimal search under varying testing costs, we derive Pandora’s Regret: a closed-form, pairwise-additive, and strictly proper scoring rule. Pandora’s Regret both elicits true probabilities and penalizes rank-reversing miscalibrations where distractors outrank the true class.”

“Our contribution is not to advance search theory, but to use the simplifying assumptions of the Pandora problem to give a decision theory-based scoring rule for sequential search with a simple closed-form pairwise structure.”

flores-1.png

“The probabilistic forecast derived from a classifier allows the sequencing of tests to be optimized for a particular patient, not just general population guidelines. Standard machine learning metrics do not model this process. They allow decision-theoretic interpretations, but correspond to expected costs from decision models where actions are taken simultaneously and independently. As a result, they do not reward good within-example ranking of classes and implicitly assume that decision thresholds can be chosen independently across labels. This ignores the difference between two forecasts that assign the same probability to the true class but distribute the remaining mass differently across competing classes: one may induce an efficient search order, while the other wastes far more resources testing the wrong alternatives first.”

“We provide focused empirical evidence on MedMNIST Yang et al. (2021, 2023) that ranking by Pandora’s Regret selects better models than standard alternatives, as measured by downstream diagnostic costs. Despite using a uniform, i.i.d. cost model, Pandora’s Regret still correlates more closely with the task-specific model of diagnostic costs than standard alternatives.”

Flores, G. A., Deshpande, Y., Brea, J. R., & Wilson, A. C. (2026). Pandora's Regret: A Proper Scoring Rule for Evaluating Sequential Search. arXiv preprint arXiv:2605.01936.

https://arxiv.org/abs/2605.01936

DFlash: Block Diffusion for Flash Speculative Decoding

What is really going on here?

“In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. We show that speculative decoding provides a natural and effective setting for diffusion models. By generating draft tokens in a single forward pass, DFlash enables efficient drafting, and by conditioning the draft model on context features extracted from the target model, it achieves high-quality drafts with higher acceptance rates. Experiments show that DFlash achieves over 6× lossless acceleration across a range of models and tasks, delivering up to 2.5× higher speedup than the state-of-the-art speculative decoding method EAGLE-3.”

D-Flash-1.png

“Overall, DFlash models trained with larger block sizes generalize well to smaller inference-time block sizes. This property enables dynamic block-size scheduling during inference to improve end-to-end efficiency. In practical serving scenarios, large blocks can increase verification cost under compute-bound settings (e.g., large batch sizes); reducing the block size in such cases can therefore yield better overall speedup. We leave adaptive block-size scheduling to future work.”

D-Flash-2.png

Chen, J., Liang, Y., & Liu, Z. (2026). DFlash: Block Diffusion for Flash Speculative Decoding. arXiv preprint arXiv:2602.06036.

https://arxiv.org/abs/2602.06036

Co-Packaged Optics Supply Chain

Isn’t it incredible that it works?

co-packaged-optics-1.png

https://leonardo-boquillon.com/photonic-cop-supply-chain

PyTorch Landscape

Useful directory

pytorch.png

https://pytorch.landscape2.io/

Reader Feedback

“Sometimes you just invent, combine, because you want to find out something works. And it does, then you wonder who would find it helpful in their lives.”

Footnotes

It’s Toronto Tech Week.

My best memory from last year’s edition was starting the morning listening to novel problems at an extremely polite breakfast, taking in a Hinton lecture, then a lawn party, and ending up at an event exclusively for Prince Edward Islander’s that we just wandered into. I learned a lot.

The market for audiences this year may be a bit strange.

Outside of Tech Week, audiences using Luma, at the best of times, have a 50% show rate. Papers In The Park can be a bit more extreme at 20%. This creates an impression of demand that doesn’t truly exist. It’s ghost demand.

And yet some, simply overwhelmed by the interest, have cancelled the physical event and switched to a zoom call. Perhaps that misses the point of attending an event physically? Or perhaps there really is such interest in the content that it has to be on zoom and there will be a massive audience? I don’t know?

Some organizers are leaving it extremely late to shift people off pending status. Uncertain if they’re in or not, many attendees have over-subscribed to events on a given time slot. They may have signalled that they’re attending four events in the same hour.

Some events, like today’s U of T lecture and lawn party, are pretty certain to be full. While some of the counter-programming may struggle for audience in spite of the overwhelming interest.

It’s likely more acute for the evening events.

There isn’t a better way.

Organizers are behaving rationally by keeping large lists on pending so that that they can start letting more people know they can come as they imagine a wave of courtesy cancellations accruing. Some organizers only want specific types of audiences. Audiences only want to network with certain types of audiences. Out of town audiences are optimizing their time.

Everybody is an adult and they’re free to choose.

And that’s the organized anarchy of Toronto Tech Week.

Never miss a single issue

Be the first to know. Subscribe now to get the gatodo newsletter delivered straight to your inbox

Subscribe to gatodo

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe