[논문 분석] Attention Is All You Need (번역)

Deep Learning

[논문 분석] Attention Is All You Need (번역)_1

대장장ㅇi 2024. 2. 9. 17:53

논문: Attention Is All You Need

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

지배적인 순차 전이 모델은 복잡한 순환 또는 합성곱 신경망을 기반으로 하며, 이는 인코더와 디코더를 포함합니다. 성능이 우수한 모델은 또한 인코더와 디코더를 어텐션 메커니즘을 통해 연결합니다. 저희는 순환과 합성곱을 완전히 배제하고 오로지 어텐션 메커니즘에 기반한 새로운 단순한 네트워크 아키텍처인 "Transformer"를 제안합니다. 두 개의 기계 번역 작업 실험 결과, 이러한 모델이 품질에서 뛰어나면서도 병렬화가 더 효과적이며 훈련에 상당히 적은 시간이 소요된다는 것을 보여줍니다. 저희 모델은 WMT 2014 영어-독일 번역 작업에서 28.4 BLEU를 달성하여 기존의 최고 결과를 포함한 앙상블보다 2 BLEU 이상 개선되었습니다. WMT 2014 영어-프랑스 번역 작업에서는 저희 모델이 8개의 GPU에서 3.5일 동안 훈련을 거친 후 41.0의 BLEU 점수를 기록하여 문헌에서 최고의 모델들보다 훨씬 적은 훈련 비용으로 새로운 단일 모델 최고의 BLEU 점수를 수립하였습니다.

Introduction

Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [31, 21, 13]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [18] and conditional computation [26], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 16]. In all but a few cases [22], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

순환 신경망, 특히 장단기 메모리 [12] 및 게이트드 리커런트 [7] 신경망은 특히 언어 모델링 및 기계 번역 [29, 2, 5]과 같은 순차적인 데이터 모델링 및 전이 문제에서 최첨단 접근 방식으로 확립되어 왔습니다. 이후로도 순환 언어 모델과 인코더-디코더 아키텍처의 경계를 더욱 확장하기 위한 수많은 노력들이 계속되었습니다 [31, 21, 13]. 순환 모델은 주로 입력 및 출력 시퀀스의 기호 위치를 기반으로 계산을 분해합니다. 계산 시간의 단계에 기호 위치를 정렬하여, 이전 숨겨진 상태 ht−1 및 위치 t에 대한 입력을 기반으로 ht라는 숨겨진 상태 시퀀스를 생성합니다. 이러한 본질적으로 순차적인 특성은 교육 예제 내에서 병렬화를 방해하여, 특히 시퀀스 길이가 길어지면 메모리 제약으로 인해 예제 간 배치가 제한됩니다. 최근의 연구는 인수화 기술 [18] 및 조건부 계산 [26]을 통해 계산 효율성에서 상당한 향상을 이뤘으며, 후자의 경우 모델 성능도 향상되었습니다. 그러나 순차적 계산의 근본적인 제약은 여전합니다. 어텐션 메커니즘은 입력 또는 출력 시퀀스의 거리에 관계없이 의존성을 모델링할 수 있게 해주어 다양한 작업에서 매력적인 순차적 모델링 및 전이 모델의 일부가 되었습니다 [2, 16]. 그러나 몇 가지 경우를 제외하고는 이러한 어텐션 메커니즘이 일반적으로 순환 네트워크와 함께 사용됩니다. 본 논문에서는 순환을 배제하고 대신 입력과 출력 간의 전역 의존성을 그리기 위해 완전히 어텐션 메커니즘에 의존하는 "Transformer"라는 모델 아키텍처를 제안합니다. Transformer는 훨씬 더 많은 병렬화가 가능하며, 8개의 P100 GPU에서 12시간 동안 훈련된 후에도 번역 품질에서 새로운 최첨단을 달성할 수 있습니다.

Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [20], ByteNet [15] and ConvS2S [8], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [11]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 22, 23, 19]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [28]. To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [14, 15] and [8].

순차 계산을 줄이는 것이 목표인 Extended Neural GPU [20], ByteNet [15], ConvS2S [8]는 모두 기본 구성 요소로 합성곱 신경망을 사용하며 모든 입력 및 출력 위치에 대해 병렬로 숨겨진 표현을 계산합니다. 이러한 모델에서는 두 임의의 입력 또는 출력 위치에서 신호를 관련시키기 위해 필요한 작업 수가 위치 간 거리에 따라 성장하며, ConvS2S의 경우 선형으로, ByteNet의 경우 로그로 증가합니다. 이로 인해 먼 위치 간의 의존성 학습이 더 어려워집니다 [11]. Transformer에서는 이를 상수 번의 작업으로 줄였으나, 가중치가 적용된 위치를 평균 내어 유효한 해상도가 감소하는 효과가 있어, 이를 3.2절에서 설명된 Multi-Head Attention을 사용하여 상쇄합니다.

자기 어텐션, 때로는 내부 어텐션이라고도 불리며, 시퀀스의 서로 다른 위치를 관련시켜 시퀀스의 표현을 계산하기 위한 어텐션 메커니즘입니다. 자기 어텐션은 읽기 이해, 요약 생성, 텍스트 함의, 작업 독립적인 문장 표현 학습과 같은 다양한 작업에서 성공적으로 사용되었습니다 [4, 22, 23, 19]. 엔드 투 엔드 메모리 네트워크는 순차적 반복이 아닌 재귀 어텐션 메커니즘을 기반으로 하며, 간단한 언어 질문 응답 및 언어 모델링 작업에서 잘 수행되었습니다 [28]. 그러나 우리의 지식으로는 Transformer가 일련의 정렬된 RNN이나 합성곱을 사용하지 않고 입력 및 출력의 표현을 계산하기 위해 완전히 자기 어텐션에 의존하는 최초의 전이 모델입니다. 다음 섹션에서 Transformer를 설명하고 자기 어텐션을 도입하고 [14, 15] 및 [8]과 같은 모델보다 그 이점을 논의하겠습니다.

Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive [9], consuming the previously generated symbols as additional input when generating the next. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

대부분의 경쟁력 있는 신경망 순차 전이 모델은 인코더-디코더 구조를 가지고 있습니다 [5, 2, 29]. 여기서 인코더는 기호 표현 (x1, ..., xn)으로 이루어진 입력 시퀀스를 연속적인 표현의 시퀀스 z = (z1, ..., zn)로 매핑합니다. 주어진 z로 디코더는 출력 시퀀스 (y1, ..., ym)를 한 번에 하나의 기호로 생성합니다. 각 단계에서 모델은 자기회귀적(auto-regressive)이며 [9], 다음을 생성할 때 이전에 생성된 기호들을 추가 입력으로 사용합니다. Transformer는 이러한 전반적인 아키텍처를 따르며, 좌우의 Figure 1의 스택된 셀프 어텐션 및 포인트 완전 연결 레이어를 사용하여 인코더와 디코더를 모두 표현합니다.

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position wise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

인코더: 인코더는 N = 6개의 동일한 레이어로 구성됩니다. 각 레이어에는 두 개의 서브 레이어가 있습니다. 첫 번째는 멀티헤드 셀프 어텐션 메커니즘이고, 두 번째는 간단한 위치별 완전 연결 피드포워드 네트워크입니다. 각각의 두 서브 레이어 주변에는 잔여 연결 [10]이 있고, 그 뒤에는 레이어 정규화 [1]가 이어집니다. 즉, 각 서브 레이어의 출력은 LayerNorm(x + Sublayer(x))로 나타낼 수 있으며, 여기서 Sublayer(x)는 서브 레이어 자체에 의해 구현된 함수입니다. 이러한 잔여 연결을 용이하게 하기 위해 모델의 모든 서브 레이어와 임베딩 레이어는 차원이 dmodel = 512인 출력을 생성합니다.

디코더: 디코더도 N = 6개의 동일한 레이어로 구성됩니다. 각 인코더 레이어에 있는 두 개의 서브 레이어에 추가로 디코더는 인코더 스택의 출력에 대한 멀티헤드 어텐션을 수행하는 세 번째 서브 레이어를 삽입합니다. 인코더와 유사하게 각 서브 레이어 주변에는 잔여 연결이 있고, 이어서 레이어 정규화가 이루어집니다. 또한 디코더 스택의 셀프 어텐션 서브 레이어를 수정하여 위치가 이후의 위치에만 어텐션할 수 있도록 합니다. 이러한 마스킹과 출력 임베딩이 한 위치만큼 오프셋되어 있기 때문에, 위치 i의 예측은 i보다 작은 위치의 알려진 출력에만 의존할 수 있도록 보장됩니다.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

어텐션 함수는 쿼리와 키-값 쌍의 집합을 출력으로 매핑하는 함수로서, 여기서 쿼리, 키, 값, 그리고 출력은 모두 벡터입니다. 출력은 값들의 가중 합으로 계산되며, 각 값에 할당된 가중치는 쿼리와 해당 키 간의 호환성 함수에 의해 계산됩니다.

3.2.1 Scaled Dot-Product Attention

We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by √ dk, and apply a softmax function to obtain the weights on the values.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as:

The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1/ √ dk . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]. We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4 . To counteract this effect, we scale the dot products by 1/ √ dk .

우리의 특별한 어텐션은 "스케일드 닷-프로덕트 어텐션"으로 불립니다 (Figure 2). 입력은 차원이 dk인 쿼리와 키, 차원이 dv인 값으로 이루어져 있습니다. 우리는 쿼리와 모든 키 간의 닷 프로덕트를 계산하고 각각을 √ dk로 나눈 후 소프트맥스 함수를 적용하여 값에 대한 가중치를 얻습니다.

실제로 우리는 동시에 하나의 행렬로 묶인 일련의 쿼리에 대해 어텐션 함수를 계산합니다. 키와 값도 행렬 K와 V로 함께 묶입니다. 우리는 출력의 행렬을 다음과 같이 계산합니다:

가장 흔히 사용되는 어텐션 함수는 가법 어텐션 [2]과 닷-프로덕트(곱셈) 어텐션입니다. 닷-프로덕트 어텐션은 우리의 알고리즘과 동일하지만 1/ √ dk 의 스케일링 팩터를 제외하고는 동일합니다. 가법 어텐션은 단일 은닉 레이어를 가진 피드포워드 네트워크를 사용하여 호환성 함수를 계산합니다.

이 두 메커니즘은 이론적 복잡성에서 유사하지만, 실제로는 닷-프로덕트 어텐션이 매우 최적화된 행렬 곱셈 코드를 사용하여 훨씬 빠르고 공간 효율적입니다. 작은 dk 값에 대해 두 메커니즘이 유사한 성능을 보이지만, dk 값이 커질수록 가법 어텐션이 스케일링 없는 닷-프로덕트 어텐션을 능가한다고 합니다 [3]. 우리는 dk 값이 커지면 닷 프로덕트가 큰 크기가 되어 소프트맥스 함수를 극도로 작은 기울기가 있는 지역으로 밀어 넣을 것으로 의심합니다. 이 효과를 상쇄하기 위해 우리는 닷 프로덕트를 1/ √ dk 로 스케일링합니다.

3.2.2 Multi-Head Attention

Instead of performing a single attention function with \(d_{\text{model}}\)-dimensional keys, values, and queries, we found it beneficial to linearly project the queries, keys, and values \(h\) times with different, learned linear projections to \(d_k\), \(d_k\), and \(d_v\) dimensions, respectively. On each of these projected versions of queries, keys, and values, we then perform the attention function in parallel, yielding \(d_v\)-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

\[ \text{MultiHead(Q, K, V)} = \text{Concat(head}_1, ..., \text{head}_h)W_O \]

Where \(\text{head}_i = \text{Attention(Q}W_{Qi}, \text{KW}_{Ki}, \text{VW}_{Vi})\)

Where the projections are parameter matrices \(W_{Qi} \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W_{Ki} \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(\text{WV}_{i} \in \mathbb{R}^{d_{\text{model}} \times d_v}\), and \(W_O \in \mathbb{R}^{h \times d_v \times d_{\text{model}}}\.

In this work, we employ \(h = 8\) parallel attention layers, or heads. For each of these, we use \(d_k = d_v = \frac{d_{\text{model}}}{h} = 64\). Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

단일 attention 함수를 \(d_{\text{model}}\)-차원 키, 값 및 쿼리로 수행하는 대신, 우리는 쿼리, 키 및 값을 각각 \(d_k\), \(d_k\), \(d_v\) 차원으로 선형 프로젝션하는 것이 유익하다는 것을 발견했습니다. 이러한 프로젝션은 각각 다른 학습된 선형 프로젝션을 사용하여 쿼리, 키 및 값을 \(h\)번 수행합니다. 각 프로젝트된 버전의 쿼리, 키 및 값에 대해 우리는 어텐션 함수를 병렬로 수행하여 \(dv\)-차원 출력 값을 얻습니다. 이들은 연결되고 다시 프로젝션되어 Figure 2에 나타난 것처럼 최종 값이 생성됩니다.

멀티헤드 어텐션은 모델이 서로 다른 위치에서 서로 다른 표현 부분 공간에서 정보에 공동으로 어텐션을 할 수 있게 합니다. 단일 어텐션 헤드의 경우 평균은 이를 억제합니다.

\[\text{MultiHead(Q, K, V)} = \text{Concat(head}_1, ..., \text{head}_h)W_O\]

여기서 \(\text{head}_i = \text{Attention(Q}W_{Qi}, \text{KW}_{Ki}, \text{VW}_{Vi})\)

프로젝션은 파라미터 매트릭스 \(W_{Qi} \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W_{Ki} \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(\text{WV}_{i} \in \mathbb{R}^{d_{\text{model}} \times d_v}\) 그리고 \(W_O \in \mathbb{R}^{h \times d_v \times d_{\text{model}}}\)로 나타낼 수 있습니다.

본 연구에서는 \(h = 8\)개의 병렬 어텐션 레이어 또는 헤드를 사용하며, 각각의 경우 \(d_k = d_v = \frac{d_{\text{model}}}{h} = 64\)를 사용합니다. 각 헤드의 차원이 감소하면서 총 계산 비용은 완전한 차원의 단일 헤드 어텐션과 유사합니다.

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways: • In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [31, 2, 8]. • The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. • Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

트랜스포머는 다음과 같이 세 가지 다른 방식으로 멀티헤드 어텐션을 사용합니다: • "인코더-디코더 어텐션" 레이어에서 쿼리는 이전 디코더 레이어에서 가져오며, 메모리 키와 값은 인코더의 출력에서 가져옵니다. 이를 통해 디코더의 각 위치가 입력 시퀀스의 모든 위치에 어텐션을 할 수 있습니다. 이는 [31, 2, 8]과 같은 시퀀스 투 시퀀스 모델의 전형적인 인코더-디코더 어텐션 메커니즘을 모방합니다. • 인코더에는 셀프 어텐션 레이어가 포함되어 있습니다. 셀프 어텐션 레이어에서 키, 값 및 쿼리는 모두 동일한 위치에서 가져옵니다. 이 경우에는 인코더의 이전 레이어의 출력입니다. 인코더의 각 위치는 인코더의 이전 레이어의 모든 위치에 어텐션을 할 수 있습니다. • 마찬가지로 디코더의 셀프 어텐션 레이어는 디코더의 각 위치가 해당 위치를 포함한 디코더의 모든 위치에 어텐션을 할 수 있게 합니다. 자기 회귀 속성을 보존하기 위해 디코더에서는 왼쪽으로의 정보 흐름을 방지해야 합니다. 이를 위해 스케일드 닷 프로덕트 어텐션 내에서 소프트맥스의 입력에 대한 모든 값을 불법적인 연결에 해당하는 것으로 마스킹하고 (−∞로 설정) 이를 구현합니다. Figure 2를 참조하세요.

3.3 Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

\[ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 \]

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is \(d_{\text{model}} = 512\), and the inner-layer has dimensionality \(d_{\text{ff}} = 2048\).

각각의 인코더와 디코더 레이어에는 어텐션 서브 레이어 외에도 각 위치에 별도로 적용되는 완전 연결 피드포워드 네트워크가 포함되어 있습니다. 이 네트워크는 ReLU 활성화 함수를 사이에 두고 두 개의 선형 변환으로 구성됩니다.

\[ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 \]

선형 변환은 서로 다른 위치에서 동일하지만 레이어 간에는 다른 매개변수를 사용합니다. 또 다른 설명 방법은 커널 크기가 1인 두 개의 합성곱으로 볼 수 있습니다. 입력과 출력의 차원은 \(d_{\text{model}} = 512\)이며, 내부 레이어의 차원은 \(d_{\text{ff}} = 2048\)입니다.

3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension \(d_{\text{model}}\). We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [24]. In the embedding layers, we multiply those weights by \(\sqrt{d_{\text{model}}}\).

다른 시퀀스 전이 모델과 마찬가지로 입력 토큰과 출력 토큰을 차원이 \(d_{\text{model}}\)인 벡터로 변환하기 위해 학습된 임베딩을 사용합니다. 또한 디코더 출력을 예측된 다음 토큰 확률로 변환하기 위해 평소와 같이 학습된 선형 변환과 소프트맥스 함수를 사용합니다. 우리 모델에서는 두 임베딩 레이어와 소프트맥스 이전의 선형 변환 사이에서 동일한 가중치 행렬을 공유합니다. 이는 [24]와 유사합니다. 임베딩 레이어에서는 이러한 가중치를 \(\sqrt{d_{\text{model}}}\)로 곱합니다.

3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension \(d_{\text{model}}\) as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [8].

In this work, we use sine and cosine functions of different frequencies:

\[ \text{PE}(pos,2i) = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \]

\[ \text{PE}(pos,2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \]

where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from \(2\pi\) to \(10000 \cdot 2\pi\). We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset \(k\), \(\text{PE}_{\text{pos}+k}\) can be represented as a linear function of \(\text{PE}_{\text{pos}}\).

We also experimented with using learned positional embeddings [8] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

우리 모델에는 순환 및 합성곱이 없으므로 모델이 시퀀스의 순서를 활용하려면 시퀀스 내 토큰의 상대적 또는 절대 위치에 대한 정보를 삽입해야 합니다. 이를 위해 우리는 인코더와 디코더 스택의 맨 아래에 "위치 인코딩"을 입력 임베딩에 추가합니다. 위치 인코딩은 임베딩과 동일한 차원 \(d_{\text{model}}\)을 갖기 때문에 두 값을 더할 수 있습니다. 학습된 및 고정된 위치 인코딩의 여러 선택사항이 있습니다 [8].

이 연구에서는 서로 다른 주파수의 사인 및 코사인 함수를 사용합니다:

\[ \text{PE}(pos,2i) = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \]

\[ \text{PE}(pos,2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \]

여기서 pos는 위치를, i는 차원을 나타냅니다. 즉, 위치 인코딩의 각 차원은 사인 파형에 해당합니다. 파장은 2π에서 10000 · 2π까지의 등비 수열을 형성합니다. 이 함수를 선택한 이유는 모델이 상대적 위치에 따라 쉽게 참조를 학습할 수 있을 것으로 기대했기 때문입니다. 고정된 오프셋 k에 대해 \( \text{PE}_{\text{pos}+k} \)를 \( \text{PE}_{\text{pos}} \)의 선형 함수로 나타낼 수 있기 때문입니다.

우리는 또한 학습된 위치 임베딩 [8]을 사용하는 실험도 진행했으며 두 버전이 거의 동일한 결과를 얻었습니다 (표 3 행 (E) 참조). 우리는 사인 함수 버전을 선택했는데, 이는 모델이 훈련 중에 만난 시퀀스 길이보다 긴 시퀀스 길이로 외삽할 수 있을 것으로 기대되기 때문입니다.

4. Why Self-Attention

In this section, we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x1, ..., xn) to another sequence of equal length (z1, ..., zn), with xi, zi ∈ ℝd, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.

One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [11]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [31] and byte-pair [25] representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position. This would increase the maximum path length to O(n/r). We plan to investigate this approach further in future work.

A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions [15], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity considerably, to O(k · n · d + n · d^2). Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.

As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

이 섹션에서는 하나의 변수 길이 심볼 표현 시퀀스 (x1, ..., xn)를 다른 동일한 길이 시퀀스 (z1, ..., zn)로 매핑하는 데 일반적으로 사용되는 순환 및 합성곱 레이어와 self-attention 레이어의 다양한 측면을 비교합니다. 여기서 xi, zi ∈ ℝd와 같이 표시되는 변수 길이의 시퀀스는 전형적인 순차 변환 인코더 또는 디코더의 hidden layer와 같습니다. Self-attention 사용을 도입하는데 있어서 세 가지 우선 조건을 고려합니다.

하나는 각 레이어 당 총 계산 복잡성입니다. 또 다른 것은 최소 순차 작업 수로 측정되는 병렬화 가능한 계산 양입니다.

세 번째는 네트워크에서 장거리 종속성 사이의 경로 길이입니다. 장거리 종속성을 학습하는 것은 많은 시퀀스 전이 작업에서 핵심 도전 과제입니다. 이러한 종속성을 학습하는 데 영향을 미치는 주요 요소 중 하나는 신호가 네트워크에서 이동하는 데 필요한 경로의 길이입니다. 입력 및 출력 시퀀스의 모든 위치 간의 이러한 경로가 짧을수록 장거리 종속성을 학습하기가 더 쉽습니다. 따라서 서로 다른 레이어 유형으로 구성된 네트워크에서 어떤 두 입력 및 출력 위치 간의 최대 경로 길이를 비교합니다.

Table 1에서 볼 수 있듯이 self-attention 레이어는 모든 위치를 상수 개의 순차 실행 작업으로 연결하는 반면, 순환 레이어는 O(n)의 순차 작업이 필요합니다. 계산 복잡성 측면에서 sequence 길이 n이 표현 차원 d보다 작을 때 self-attention 레이어는 주로 최신 기계 번역 모델에서 사용되는 문장 표현과 같이 작동합니다. For instance, word-piece [31] 및 byte-pair [25] 표현. 매우 긴 시퀀스를 다루는 작업의 계산 성능을 향상시키려면 self-attention을 입력 위치 주변의 크기 r에서만 고려하도록 제한할 수 있습니다. 이로써 최대 경로 길이를 O(n/r)로 증가시킬 수 있습니다. 이 접근 방식을 미래 작업에서 더 자세히 조사할 계획입니다.

커널 너비 k < n의 경우 단일 컨볼루션 레이어는 입력 및 출력 위치의 모든 쌍을 연결하지 않습니다. 그렇게 하려면 연속적인 커널의 경우 O(n/k) 컨볼루션 레이어 또는 증가하는 커널의 경우 O(logk(n))가 필요합니다 [15], 네트워크에서 두 위치 간의 가장 긴 경로의 길이를 증가시킵니다. Convolutional 레이어는 일반적으로 순환 레이어보다 k의 배수로 더 비싸지만, Separable convolutions [6]을 사용하면 복잡성이 상당히 감소하여 O(k · n · d + n · d^2)가 됩니다. 그러나 k = n인 경우에도 separable convolution의 복잡성은 self-attention 레이어와 point-wise feed-forward 레이어의 조합과 동일합니다.

부수적인 이점으로 self-attention은 더 해석 가능한 모델을 제공할 수 있습니다. 저희 모델의 attention 분포를 검토하고 그 예제를 별첨에서 제시 및 토의합니다. 개별 attention head는 명확하게 다른 작업을 수행하는 것으로 보이며, 많은 head는 문장의 구문 및 의미 구조와 관련된 행동을 나타내는 것으로 보입니다.

5. Training

This section describes the training regime for our models.

이 섹션은 우리 모델의 교육 체제에 대해 설명합니다.

5.1 Training Data and Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared sourcetarget vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [31]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.

우리는 약 450만 개의 문장 쌍으로 구성된 표준 WMT 2014 영어-독일 데이터셋에서 훈련했습니다. 문장은 바이트 페어 인코딩 [3]을 사용하여 인코딩되었으며, 이는 약 37000개의 토큰으로 구성된 공유 소스-타겟 어휘를 가지고 있습니다. 영어-프랑스어의 경우, 우리는 훨씬 더 큰 WMT 2014 영어-프랑스어 데이터셋을 사용했으며, 여기에는 3600만 개의 문장이 포함되어 있었고 토큰은 32000개의 워드피스 어휘로 분할되었습니다 [31]. 문장 쌍은 근사한 시퀀스 길이로 묶였습니다. 각 훈련 배치에는 약 25000개의 소스 토큰과 25000개의 타겟 토큰을 포함하는 문장 쌍 세트가 포함되었습니다.

5.2 Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).

우리는 8개의 NVIDIA P100 GPU를 사용하는 한 대의 머신에서 모델을 훈련했습니다. 논문 전반에 걸쳐 설명된 하이퍼파라미터를 사용한 기본 모델의 경우 각 훈련 단계는 약 0.4초가 걸렸습니다. 우리는 기본 모델을 총 100,000 단계 또는 12시간 동안 훈련했습니다. 큰 모델의 경우(표 3의 아래 줄에 설명되어 있음), 단계 시간은 1.0초였습니다. 큰 모델은 300,000 단계 또는 3.5일 동안 훈련되었습니다.

5.3 Optimizer

We used the Adam optimizer [17] with β1 = 0.9, β2 = 0.98, and ε = 10^-9. We varied the learning rate over the course of training, according to the formula:

\[ \text{lrate} = d_{\text{model}} \cdot \min\left(\text{step\_num}^{-0.5}, \text{step\_num} \cdot \text{warmup\_steps}^{-1.5}\right) \]

This corresponds to increasing the learning rate linearly for the first \(\text{warmup\_steps}\) training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used \(\text{warmup\_steps} = 4000\).

우리는 Adam 옵티마이저 [17]를 사용했고, 그 파라미터는 β1 = 0.9, β2 = 0.98이며, ε = 10^-9입니다. 훈련 동안 학습률을 조정했고, 다음 공식에 따라 변화시켰습니다:

\[ \text{lrate} = d_{\text{model}} \cdot \min\left(\text{step\_num}^{-0.5}, \text{step\_num} \cdot \text{warmup\_steps}^{-1.5}\right) \]

이는 처음 \(\text{warmup\_steps}\) 훈련 단계 동안 학습률을 선형으로 증가시키고, 그 이후로는 단계 번호의 역 제곱근에 비례하여 감소시키는 것에 해당합니다. 우리는 \(\text{warmup\_steps} = 4000\)을 사용했습니다.

5.4 Regularization

We employ three types of regularization during training:

Residual Dropout:
We apply dropout [27] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of \(P_{\text{drop}} = 0.1\).
Label Smoothing:
During training, we employed label smoothing of value \(\epsilon_{\text{ls}} = 0.1\) [30]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

우리는 훈련 중 세 가지 유형의 정규화를 사용합니다:

잔여 드롭아웃:
우리는 각 하위 레이어의 출력에 드롭아웃 [27]을 적용합니다. 이는 하위 레이어 입력에 추가되기 전에 정규화됩니다. 또한, 인코더 및 디코더 스택의 임베딩 및 위치 인코딩의 합에도 드롭아웃을 적용합니다. 기본 모델의 경우, \(P_{\text{drop}} = 0.1\)의 비율을 사용합니다.
레이블 스무딩:
훈련 중에는 값이 \(\epsilon_{\text{ls}} = 0.1\)인 레이블 스무딩을 적용했습니다 [30]. 이는 모델이 더 불확실해지도록 학습하므로 퍼플렉서티에 영향을 미칩니다. 그러나 정확도와 BLEU 점수를 향상시킵니다.

6. Results

6.1 Machine Translation

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate P_drop = 0.1, instead of 0.3.

For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of 4 and length penalty α = 0.6 [31]. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + 50, but terminate early when possible [31].

Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU [5].

WMT 2014 영어-독일어 번역 작업에서 큰 트랜스포머 모델(표 2의 Transformer (big))은 이전에 보고된 최고의 모델(앙상블 포함)보다 2.0 BLEU 이상 높은 성과를 거두어 28.4의 새로운 최고 성능 BLEU 점수를 기록했습니다. 이 모델의 설정은 표 3의 하단 행에 나와 있습니다. 8개의 P100 GPU에서 3.5일 동안 훈련되었습니다. 심지어 기본 모델도 이전에 게시된 모든 모델 및 앙상블을 능가하는데, 경쟁 모델 중 일부의 훈련 비용만 소모했습니다.

WMT 2014 영어-프랑스어 번역 작업에서 우리의 큰 모델은 BLEU 점수 41.0을 기록하여 모든 이전에 게시된 단일 모델을 능가했으며, 이전 최고 성능 모델의 1/4 미만의 훈련 비용으로 달성되었습니다. 영어-to-French 모델의 훈련에 사용된 Transformer (big)은 드롭아웃 비율 P_drop = 0.1을 사용했습니다.

기본 모델의 경우 마지막 5개의 체크포인트를 평균화한 단일 모델을 사용했으며, 이는 10분 간격으로 작성되었습니다. 큰 모델의 경우 마지막 20개의 체크포인트를 평균화했습니다. 빔 서치 크기는 4이고 길이 패널티는 α = 0.6 [31]로 설정했습니다. 이러한 하이퍼파라미터는 개발 세트에서 실험을 통해 선택되었습니다. 추론 중에 최대 출력 길이를 입력 길이 + 50으로 설정했지만 가능한 경우 일찍 종료합니다 [31].

표 2는 결과를 요약하고 문헌에서 다양한 모델 아키텍처와의 번역 품질 및 훈련 비용을 비교합니다. 우리는 모델을 훈련하는 데 사용된 부동 소수점 연산 양을 계산하기 위해 훈련 시간, 사용된 GPU 수 및 각 GPU의 지속적인 단정도 부동 소수점 용량의 추정치를 곱하여 계산했습니다 [5].

6.2 Model Variations

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.

In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [8], and observe nearly identical results to the base model.

Transformer의 다양한 구성 요소의 중요성을 평가하기 위해 우리는 기본 모델을 다양한 방식으로 변형하여 영어에서 독일어로의 번역 작업에서의 성능 변화를 측정했습니다. 빔 서치는 이전 섹션에서 설명한 대로 사용되었지만 체크포인트 평균은 사용되지 않았습니다. 이러한 결과를 Table 3에 제시하였습니다.

Table 3의 (A) 행에서는 연산량을 일정하게 유지하면서 어텐션 헤드의 수와 어텐션 키 및 값의 차원을 변경합니다(Section 3.2.2 참조). 단일 헤드 어텐션은 최적 설정보다 0.9 BLEU 정도 성능이 나쁩니다. 또한 너무 많은 헤드도 품질을 저하시킵니다.

Table 3의 (B) 행에서는 어텐션 키 크기를 줄이면 모델 품질이 저하됩니다. 이는 호환성을 결정하는 것이 쉽지 않으며 도트 프로덕트보다 더 정교한 호환성 함수가 유익할 수 있음을 시사합니다. (C) 및 (D) 행에서는 예상대로 더 큰 모델이 더 좋으며, 드롭아웃은 오버피팅을 방지하는 데 매우 도움이 된다는 것을 관찰할 수 있습니다. (E) 행에서는 사인 및 코사인 함수로 된 위치 인코딩을 학습 기반 위치 임베딩으로 대체하였고, 기본 모델과 거의 동일한 결과를 관찰했습니다.

7. Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours. The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor.

이 연구에서는 Transformer를 소개합니다. 이 모델은 어텐션(Attention)에 기반하며, 인코더-디코더 아키텍처에서 흔히 사용되는 순환 레이어를 멀티헤드 셀프 어텐션으로 대체합니다. 번역 작업에서 Transformer는 기존의 순환 또는 컨볼루션 레이어를 사용한 아키텍처보다 훨씬 빠른 속도로 훈련될 수 있습니다. WMT 2014의 영어-독일 및 영어-프랑스 번역 작업에서 Transformer는 새로운 최고 성능을 달성하였습니다. 이 중 영어-독일 번역 작업에서는 최고 모델이 이전에 보고된 앙상블 모델을 능가하는 결과를 얻었습니다. 앞으로는 어텐션 기반 모델의 미래에 기대하며, 텍스트 이외의 입력 및 출력 모드에 대한 확장과 이미지, 오디오, 비디오와 같은 대량의 입력 및 출력을 효과적으로 처리하기 위한 로컬하고 제한된 어텐션 메커니즘을 연구할 계획입니다. 또한, 생성 과정을 보다 순차적이지 않게 만드는 것도 연구 목표 중 하나입니다. 사용된 모델을 훈련하고 평가한 코드는 https://github.com/tensorflow/tensor2tensor에서 확인할 수 있습니다.