Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Match is a bad word, the don’t match, they are duals. The residual stream aka identity mapping needs to be the identity of the attention mechanism as the attention mechanism learns.

But this is the same for all residual streams, not just those in transformers.

Join my discord to discuss this further https://discord.gg/mr9TAhpyBW



Wait-- the residual stream makes the attention mechanism learn the difference from the identity! Are you sure you're not thinking about auto-encoders?

Edit: ok, Discord it is.


I don’t believe autodiff is finding the difference in that sense. It’s finding derivatives.


Well, the paper uses gradient descent to minimize that difference, like auto-encoders do.


Gradient descent is just how neural networks (including auto-encoders) optimize parameters to minimize the loss function. They do this using derivatives to descend down the slope of the function. Autodiff is one way to compute the derivatives. Maybe we’re saying the same thing.


Yep, I was just asking Adam* to justify his loss function.

*pun intended :)


Do you see a similarity between residual stream and Dirac function?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: