Nous Research releases Token Superposition Training for 2-3x LLM pretraining speedup

Nous Research released Token Superposition Training (TST), a modification to the standard LLM pretraining loop that achieves a 2–3× wall-clock speedup at matched FLOPs without changing model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens using averaged embeddings, then switches to standard next-token prediction for the remainder of the run.

Sources

X mentions

11k ▲

First seen

4Dago

Velocity

+2%/6h