Attention is all you need — but do you need all of it?

Neural Networks Nov 2024

The original transformer paper made attention mechanisms the center of everything. And it was right — attention is remarkably general. But “all you need” turned into “use as much as you can afford,” and that’s a different claim.

Working on identity resolution at scale, I’ve spent a lot of time thinking about where attention actually helps versus where it’s just expensive. For structured tabular data with known field semantics, a well-designed field fusion layer often outperforms full self-attention at a fraction of the compute. You already know what the fields mean. You don’t need the model to figure that out.

Attention is powerful because it lets the model discover relationships. But when you already know the relationships — when the schema is your prior — sometimes the right move is to encode that knowledge explicitly and save the attention budget for the parts you don’t know.

This isn’t an argument against transformers. It’s an argument for thinking about what you’re actually asking the model to learn.