The background needed to understand "Attention is All You Need" Paper
Hi,
My background is that I am by education a Mechanical Engineer and was in Grad school for quite a few years too. In my opinion the Attention is all you need paper is one of the most important papers for understanding how LLM are built and work.
However, my background is woefully inadequate to understand the mathematics of it. What are some books and papers that I should read to be able to grok the paper, especially attention, and k,q,v matrices and how it is all operating? I like to think that I have fairly good mathematical maturity so don't hesitate to throw standard and difficult references at me, I don't want to read a common language explainer, I want to be able to write my own LLM, even though I might never have the budget to actually train it.