Olivier Grisel<p>Interesting developments in subquadratic alternatives to self-attention based transformers for large sequence modeling (32k and more).</p><p>Hyena Hierarchy: Towards Larger Convolutional Language Models</p><p><a href="https://arxiv.org/abs/2302.10866" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">arxiv.org/abs/2302.10866</span><span class="invisible"></span></a></p><p>They propose to replace the quadratic self-attention layers by an operator built with implicitly parametrized long kernel 1D convolutions.</p><p><a href="https://sigmoid.social/tags/DeepLearning" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DeepLearning</span></a> <a href="https://sigmoid.social/tags/LLMs" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LLMs</span></a> <a href="https://sigmoid.social/tags/PaperThread" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>PaperThread</span></a> </p><p>1/4</p>