DeepSeek releases the paper "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Mechanism"

02-18

This article is machine translated

Show original

PANews on February 18 reported that the DeepSeek team recently released a technical paper titled "Natively Sparse Attention: A Sparse Attention Mechanism Aligned with Hardware and Natively Trainable". The paper introduces their proposed NSA (Natively Sparse Attention) mechanism. NSA combines algorithmic innovation and hardware optimization, aiming to achieve efficient long-text modeling. Its core innovations include:

1. A dynamic hierarchical sparsity strategy, combining coarse-grained token compression and fine-grained token selection, to preserve global context information and local accuracy;

2. Significant acceleration of computation through algorithm design that balances arithmetic intensity and modern hardware optimization;

3. Support for end-to-end training, reducing the computational cost of pre-training while maintaining model performance.

The experimental results show that NSA performs excellently in long-text tasks and instruction reasoning, especially in processing sequences of 64k length, achieving significant acceleration in decoding, forward propagation, and backpropagation.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content