Sinusoidal positional embeddings have always seemed a bit mysterious - even more so since papers don't tend to delve much into the intuition behind them. For example, from Vaswani et al., 2017:
> That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2ฯ to 10000 ยท 2ฯ. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos).
Inspired largely by the RoFormer paper (https://arxiv.org/abs/2104.09864), I thought I'd write a post that dives a bit into how intuitive considerations around linearity and relative positions can lead to the idea of using sinusoidal functions to encode positions.
Sinusoidal positional embeddings have always seemed a bit mysterious - even more so since papers don't tend to delve much into the intuition behind them. For example, from Vaswani et al., 2017:
> That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2ฯ to 10000 ยท 2ฯ. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos).
Inspired largely by the RoFormer paper (https://arxiv.org/abs/2104.09864), I thought I'd write a post that dives a bit into how intuitive considerations around linearity and relative positions can lead to the idea of using sinusoidal functions to encode positions.
Would appreciate any thoughts or feedback!