Decentralized Unsupervised Pre-Training (DUPT)
Last updated
Last updated
Decentralized Unsupervised Pre-Training aims to train language models using a decentralized approach, where the training data is distributed across multiple sources or oracles. The main objective is to maximize the likelihood of predicting the next token in a sequence given its preceding context, thereby enabling the model to learn the underlying structure of the language.
Oracle Definition
Let OjO_{j}Oj be the jjj-th Oracle, each with its own unsupervised corpus of tokens.
Unsupervised Corpus of Tokens
U(Oj)={u1(Oj),…,un(Oj)}U(O_{j}) = \{ u_{1}(O_{j}), \ldots, u_n(O_{j}) \}U(Oj)={u1(Oj),…,un(Oj)}
Likelihood Maximization
The goal is to maximize the likelihood of the sequence of tokens.
Formula: Likelihood (L1)
L1(U(Oj))=∑ilogP(ui(Oj)∣ui−k(Oj),…,ui−1(Oj);Θ)L_{1}(U(O_{j})) = \sum_{i} \log P(u_{i}(O_{j}) | u_{i-k}(O_{j}), \ldots, u_{i-1}(O_{j}); \Theta)L1(U(Oj))=∑ilogP(ui(Oj)∣ui−k(Oj),…,ui−1(Oj);Θ)
Context Window Size (k)
The parameter kkk represents the size of the context window used to predict each token.
Conditional Probability and Neural Network Parameters
The conditional probability PPP is modeled using a neural network with parameters Θ\ThetaΘ. These parameters are optimized using stochastic gradient descent.
Multi-layer Transformer Decoder
The model employs a multi-layer Transformer decoder architecture.
Multi-headed Self-Attention Operation
This operation allows the model to focus on different parts of the input context.
Position-wise Feedforward Layers
These layers apply transformations to the input tokens independently.
Output Distribution Over Target Tokens
The model produces an output distribution over the possible target tokens.
Detailed Equations and Explanations
Context Vector of Tokens (U)
U=(u−k,…,u−1)U = (u_{-k}, \ldots, u_{-1})U=(u−k,…,u−1)
Token Embedding Matrix (W_e)
h0=UWe+Wph_{0} = UW_{e} + W_{p}h0=UWe+Wp
Position Embedding Matrix (W_p)
Embeddings are added to the token embeddings to incorporate positional information.
Transformer Block Operations
h_{l} = \text{transformer_block}(h_{l-1}) \quad \forall i \in [1, n]
Softmax Operation for Output Distribution
P(u)=softmax(hnWeT)P(u) = \text{softmax}(h_{n}W_{e}^{T})P(u)=softmax(hnWeT)