GPT from Scratch

学到和应用的技能– Pytorch，编码器架构，文本处理，变压器体系结构，自我注意力和多头注意力层，馈送前传神经网络。

主意

这是GPT（生成预审预测的变压器）模型的基本版本。它基于相同的变压器体系结构，该结构由一个自我发挥的层和馈送前向神经网络组成。据预估计，可以根据先前生成的令牌生成一个令牌。

imp注意

这不是一个程序，我们使用已经可用的LLM API并微调它来创建我们自己的本地聊天机器人，而是在这里，我们正在编写从头开始的算法，该算法首先用来制作LLM。

该项目是在Google Colab上完成的，以便使用COLAB上可用的GPU资源。

安装和导入库

首先，我们需要安装和导入必要的库。

 pip install python-docx

要读取DOC文件，我们需要安装python-docx库，然后我们需要导入docx 。同样，我们需要进口torch ，因为我们将使用Pytorch库进行此项目。

 import docx
import torch

我们需要导入torch.nn软件包，因为此软件包提供了各种类和功能，以帮助我们创建和管理神经网络。在nn类中，我们需要functional模块。该模块提供了一系列有用的功能，这些功能通常在神经网络操作中使用。

 import torch.nn as nn
from torch.nn import functional as F

定义超参数

 batch_size = 64 
block_size = 256 
max_iters = 5000
eval_interval = 100
learning_rate = 1e-4

device = \'cuda\' if torch.cuda.is_available() else \'cpu\'

eval_iters = 200
n_embd = 512
n_head = 8
n_layer = 8
dropout = 0.1

高参数值很重要，因为它们允许我们调整模型以获得最佳结果。 batch_size定义将在更新权重之前立即通过网络的样本数量。 block_size定义了模型将处理的输入序列或令牌的长度。 max_iters设置了训练模型的最大迭代数（或步骤），其中每个迭代都会处理一批数据并更新模型的参数。 eval_interval指定将评估该模型的迭代数量之后。 learning_rate控制模型的参数相对于损耗梯度的调整。较小的学习率提供了稳定且精确的更新，但收敛速度较慢。

我们需要设置更多的超参数。 eval_iters指定在评估验证集上模型的性能时运行的迭代次数。我们运行评估循环以进行迭代次数，以获得对模型性能的可靠估计。 n_embd代表嵌入维度，在神经网络中，它定义了表示单词/令牌的向量空间的大小。较高的n_embd值可以捕获令牌之间的更细微的关系，但也会增加计算复杂性。在变压器模型的背景下， n_head是多注意机制中注意力头的数量。具有多个头部可以使模型同时聚焦输入序列的多个部分，从而提高了模型捕获令牌之间关系的能力。 dropout是一种正规化技术，用于防止过度拟合，并帮助模型更好地概括。它通过在培训时间内的每个更新中随机将输入单元的一部分设置为零来实现此操作。

 torch.manual_seed(108)

因此，我们为确保可重复性的环境播种。种子中的数字设置为我们喜欢的任何数字，并且不会更改，以便程序中的随机过程给出相同的输出。由于播种，诸如重量初始化，数据改组，辍学之类的随机过程始终相同。这是需要的，以便我们可以在调整超参数的情况下，仅由于超参数而不是由于某些随机变化而导致的结果差异。

加载文档并排序独特的字符

 doc = docx.Document(\'/content/Mahabharat annotated .docx\')

text = \'\'
for paragraph in doc.paragraphs:
    text += paragraph.text + \'\\n\'

我们上传文档，在我的情况下，这是印度古代历史史诗《摩ab婆制》的文档文件。我选择了这部史诗，因为它是地球上最大的史诗诗，因此我们获得了大量数据来训练我们的模型。然后，我们在文档的段落上初始化一个空字符串并迭代，以获取每个段落中的文本（字符），并将它们串联到我们的空字符串text中。属于一个段落的字符由newline字符“ \\ n”隔开。

我们可以通过简单地打印文本字符串的长度以及一些初始字符来检查字符总数

 print(\"length of dataset in characters: \", len(text))
print(text[:1232]) # Let\'s look at first 1232 characters

输出

 length of dataset in characters:  14111937

The Mahabharata
of
Krishna-Dwaipayana Vyasa

BOOK 1
ADI PARVA


THE MAHABHARATA
ADI PARVA

SECTION I

Om! Having bowed down to Narayana and Nara, the most exalted male being, and also to the goddess Saraswati, must the word Jaya be uttered.

Ugrasrava, the son of Lomaharshana, surnamed Sauti, well-versed in the Puranas, bending with humility, one day approached the great sages of rigid vows, sitting at their ease, who had attended the twelve years\' sacrifice of Saunaka, surnamed Kulapati, in the forest of Naimisha. Those ascetics, wishing to hear his wonderful narrations, presently began to address him who had thus arrived at that recluse abode of the inhabitants of the forest of Naimisha. Having been entertained with due respect by those holy men, he saluted those Munis (sages) with joined palms, even all of them, and inquired about the progress of their asceticism. Then all the ascetics being again seated, the son of Lomaharshana humbly occupied the seat that was assigned to him. Seeing that he was comfortably seated, and recovered from fatigue, one of the Rishis beginning the conversation, asked him, \'Whence comest thou, O lotus-eyed Sauti, and where hast thou spent the time? Tell me, who ask thee, in detail.\'

我们需要查看原始文件，以便当我们的模型打印出学习的输出时，我们应该能够查看并检查写作方式是否与原始文本相似。现在，我们将分开并整理文档中的所有唯一字符。

 chars = sorted(list(set(text)))
vocab_size = len(chars)
print(\'\'.join(chars))
print(vocab_size)

输出

 !\"&\'(),-.0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]_`abcdefghijklmnopqrstuvwxyz—
82

因此，完整文档中有82个独特的字符，上面显示了所有这些字符。

映射和创建编码器和解码器

从这里开始，我们实际上将深入研究代码的重要部分。首先，我们将映射字符和整数，这在文本处理之类的任务中很有用。

 stoi = { ch:i for i,ch in enumerate(chars) }

stoi代表“字符串到整数”。词典理解使用char在字符的列表或字符串上迭代，该字符使用enumerate提供了索引（i）和字符（ch），从而通过键（字符）和值（Integers）对将每个字符映射到其相应的索引。

 itos = { i:ch for i,ch in enumerate(chars) }

itos代表“整数到弦”，与stoi词典完全相反。在这里，钥匙是整数，而值是字符。

我们的编码器是一个将字符串作为输入的函数，并输出整数列表，而解码器是一个函数，该函数获取整数列表并输出字符串。

 encode = lambda s: [stoi[c] for c in s]
decode = lambda l: \'\'.join([itos[i] for i in l])

encode是一个lambda函数，它使用一个字符串s ，并使用列表理解来将字符串中的每个字符转换为使用stoi词典的相应整数。 decode也是lambda函数，该函数列出了整数i的列表，并使用列表理解来将列表l中的每个整数i转换为相应的字符，然后将字符串联以形成字符串。让我们检查它们的行动：

 print(encode(\"Nara Narayana\"))
print(decode(encode(\"Nara Narayana\")))

输出

 [37, 55, 72, 55, 1, 37, 55, 72, 55, 79, 55, 68, 55]
Nara Narayana

正如我们看到它们正常工作的那样，让我们编码保存在text变量中的整个数据集并将其存储到火炬中。

 data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:10])

输出

 torch.Size([14111937]) torch.int64 
tensor([ 0,  0, 43, 62, 59,  1, 36, 55, 62, 55])

张量的大小完全等于我们之前检查的text变量的长度，因此这意味着所有内容都已正确编码。

培训和验证分裂

我们将我们的数据分为培训和验证集，其中90％的数据用于培训，而10％用于验证。

 n = int(0.9*len(data))
train_data = data[:n] 
val_data = data[n:]

生成数据批次

由于我们创建了Encoder-Decoder体系结构，因此我们需要创建一个将数据划分为一小批data ，检查或作为输入的函数，将第一个整数从该批次中输入，并将下一个整数打印为输出。我们需要这使我们的模型学习哪个整数之后是整数。这些整数不过是我们data中的字符串，这些字符串通过使用stoi词典将其转换为整数。因此，通过能够学习哪个整数遵循我们的模型实际上学习哪个字符串的符合哪个字符串。该学习将使模型以自动加注为单位，即模型将根据先前生成的令牌生成一个令牌。

 def get_batch(split):
    data = train_data if split == \'train\' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

在上面的代码段中，首先，我们使用函数get_batch的拆分参数在train_data和val_data之间进行选择。然后y我们为大小batch_size的批处理生成随机起始索引x

功能以计算培训和验证数据集的平均模型损失

因此，我们将创建一个函数estimate_loss ，以评估模型在培训和验证数据集上的性能。我们需要在训练过程中蒙受模型的表现。我们将通过计算许多迭代的平均损失来做到这一点。

 @torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in [\'train\', \'val\']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

@torch.no_grad()是一种装饰器，可禁用梯度计算，可减少内存使用量并有助于加速计算。我们将其与评估一样，不需要梯度。然后，我们初始化了输出字典的字典，并将我们的模型设置为eval模式，该模式在训练和验证阶段的作用不同，并影响辍学和批处理归一层。我们循环介绍培训和验证设置，并初始化损失张量，这将存储每个评估迭代的损失值。然后，我们循环遍历每次迭代，在每次迭代中，我们从get_batch函数中获取X ， Y ，我们还计算了模型的预测（ logits ）和损失，并将损耗值存储在losses张量中。然后在评估循环之外，我们计算所有迭代中的平均损失，并将其保存在我们的out字典中。最后，我们将模型设置为训练模式。

注意就是您所需要的

自我发挥作用层使模型可以权衡彼此序列中不同单词的重要性。这就是模型知道接下来要打印哪个单词的方式，从给定单词之后的许多单词。因此，这种重要性通过考虑与其他单词的关系来帮助GPT了解顺序中每个单词的上下文。它通过将序列中的每个单词转换为3个向量查询（q），键（k）＆value（v）来做到这一点。这3个向量是通过首先将每个单词转换为嵌入向量的，然后用3个不同的学习重量矩阵（即WQ，WK和WV）计算该嵌入矢量的点产物。查询代表当前单词在寻找的内容，换句话说，键表示可以与查询相匹配的每个单词的特征，并且值代表每个单词的实际信息，这些信息将用于计算最终输出。查询和钥匙的点产物被乘在一起，并通过键向量的尺寸的平方根缩放，因为这稳定了梯度。然后将SoftMax函数应用于此缩放分数，以将其转换为概率，因为这将使它们的总和为1，这将使它们更容易被解释为权重。获得的权重是我们所说的注意力评分，这是整个机制以注意力层的名称。然后使用此注意分数或权重计算值向量的加权总和，这是每个单词的最终输出。

上图显示了自我注意力头的头部

自我注意力头

现在，我们将创建一个自我发场层，也称为一个头部，因为它代表了一种多发机制中的一种注意机制，其中将存在多个这样的头部。它被称为自我注意事项，因为键和值是从与查询相同的来源产生的。为此，我们将创建一个类，该类将使我们的输入嵌入到键，查询和值向量，然后进行注意力评分的计算，并使用注意力权重找到值向量。

 class Head(nn.Module):
    \"\"\" one head of self-attention \"\"\"

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer(\'tril\', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float(\'-inf\')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

类Head从nn.module继承，即Pytorch神经网络模块。使用此模块，我们创建3个线性层（ self.key ， self.query ， self.value ），分别将输入嵌入转换为键，查询和值向量。 tril代表下部三角矩阵，该矩阵可确保模型在预测电流令牌时无法看到未来令牌。由于此块是一个解码器注意块，它用于自动回归设置中，我们可以通过删除使用tril掩盖的单线线将其转换为编码器注意块，从而允许所有令牌通信。它通过掩盖序列中的未来位置来做到这一点。辍学层用于防止过度拟合，通过在训练过程中随机将一些注意力重量设置为零。输入x具有形状（b，t，c），其中b是批处理大小，c是序列长度，c是特征的数量，即嵌入尺寸。使用先前创建的线性层，我们将输入x转换为查询，键和值向量。注意分数是通过服用DOT产品并缩放它来计算的。这会导致形状矩阵（b，t，t），其中每个元素代表序列中2个位置之间的注意力分数。然后，我们通过将注意力分数设置为-inf来掩盖它，从而确保模型仅关注以前和当前位置。然后，蒙版的注意力得分通过软马克斯函数，以获得权重，并应用了辍学。然后，这用于通过乘以值向量来计算最终输出。

多头关注

现在，由于我们已经创建了一个注意力层头，让我们创建多个这样的头部，这些头将并行工作以提高性能。

多头注意层

为此，我们创建了一个类，该类再次从pytorch nn.module模块神经网络模块继承，该模块代表变压器模型中使用的多头机理。

 class MultiHeadAttention(nn.Module):
    \"\"\" multiple heads of self-attention in parallel \"\"\"

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

self.heads是Head实例的列表，每个列表代表一个自我注意力的头脑。 self.proj是一个线性层，将所有注意力的串联输出投射回原始嵌入维度（ n_embd ）。输入x使用列表理解中的每个注意力头self.heads 。所有注意力头的输出沿最后一个维度dim=-1串联，然后通过投影层self.proj将其映射到原始嵌入尺寸n_embd 。

 class FeedFoward(nn.Module):
    \"\"\" a simple linear layer followed by a non-linearity \"\"\"

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    \"\"\" Transformer block: communication followed by computation \"\"\"

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we\'d like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

上面定义的类只是一个简单的前馈网络和一个单个变压器块，其中包含Muli-Head注意力，然后是FeedForward网络。 self.ln1和self.ln2是层归一层层，用于稳定和加速训练。在此中， forward功能用于创建剩余连接，因为这有助于训练更深的网络并解决消失的梯度问题。这是针对FeedForward网络再次完成的。

Bigram语言模型

这是我们必须制作的最后一类，这将是一个简单的语言模型，可以根据当前上下文预测顺序的下一个令牌。该类也将是Pytorch神经网络模块的继承。

 class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

它将词汇中的每个令牌映射到嵌入向量，并通过将序列中的每个位置映射到嵌入向量来向模型提供位置信息。 self.blocks是一系列Block实例，每个实例都包含一个多头的自我发挥机制和馈电网络。然后，将最终层归一化并映射到词汇函数中每个令牌的词汇尺寸的词汇大小的ligits，其中B是批处理大小，t是序列长度，其中b，t）的输入（b，t）。初始化令牌嵌入和Psoitional嵌入为tok_emb和pos_emb ，然后将其求和以形成输入并通过变压器块。然后将其归一化并通过线性层以产生逻辑（logits是最终层的直接输出，这些输出尚未转换为概率）。然后，我们定义一个函数generate函数，该函数通过迭代预测和采样下一个令牌来基于当前上下文生成新的代币。

 model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, \'M parameters\')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f\"step {iter}: train loss {losses[\'train\']:.4f}, val loss {losses[\'val\']:.4f}\")

    # sample a batch of data
    xb, yb = get_batch(\'train\')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step

上述代码片段使用标准训练环开始训练BigramLanguageModel ，并评估损失。它采样数据，计算梯度并更新模型参数。

最后，我们生成模型，调用模型和decode函数的generate方法。

 # generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

这将根据max_new_tokens指定的令牌数量打印输出IE。

以下是训练5000步后获得的输出。

输出

 SECTION LIII

Vaipaudeva said, \"Indeed, then, knew by What, O Duryodhana, plaughtere foes with both nisted injuge and profit of the whole objects. In what the Suta\'s son of the earth whetter is he is Citrana.\'\"

SECTION XVI

\"Hearing many wifal unto one that abound with irresible at these the gods. O bull thou canting interrofal and be incovered by me. What be what what I know are got and the grant and complexion. Griet shill, that bulls thou shall ever constly, Opinerce one, forwarth, when now blossed by the means of my advancing Ashtavathama, Thas are very rend roars of gold closse, immovable proceeds, the king to Dusasana have (been by the rays of the bodies), and other, with sharp be combated and righteousness of arrows. Armed with strength and effects in retror fire in the track race. And these excellent cannot fear from his enembled its wrongs and children sended thou shouts me he influence with even he hard. Those deity to be, heard these weapons do now, in one limps of husbulilary. The mighty bestowing those thus kingdom from onten the fiery delight of his deer-imper disragrection. Do thou a tattribute to doing addressed with weapons forsament so behold and swalled by the man-bodies and monkey and homage, of her snake, (ex-remeply) dog. The all about of duni, let thy slaughter me. Give, have cut off hard bour performing savat, tell me!\'

我们可以将输出与原始文件匹配，并看到撰写和提及部分数字类似于原始文件，这意味着该模型已经从文档中学到了语义，但是我们可以看到输出不是有意义的，这意味着了解一般语法，语言的复杂性，并形成有意义的句子，它需要训练，甚至需要更多数据。因此，我们实现了建立变压器解码器体系结构，根据我们自己的数据训练它的目标，并成功预测了遵循原始Epic语义的学习输出。