Decentralized Unsupervised Pre-Training (DUPT)

Definition and Objectives

Decentralized Unsupervised Pre-Training aims to train language models using a decentralized approach, where the training data is distributed across multiple sources or oracles. The main objective is to maximize the likelihood of predicting the next token in a sequence given its preceding context, thereby enabling the model to learn the underlying structure of the language.

Mathematical Formulation

Oracle Definition

Let OjO_{j}Oj be the jjj-th Oracle, each with its own unsupervised corpus of tokens.

Unsupervised Corpus of Tokens

U(Oj)={u1(Oj),…,un(Oj)}U(O_{j}) = \{ u_{1}(O_{j}), \ldots, u_n(O_{j}) \}U(Oj)={u1(Oj),…,un(Oj)}

Language Modeling Objective

Likelihood Maximization

The goal is to maximize the likelihood of the sequence of tokens.

Formula: Likelihood (L1)

L1(U(Oj))=∑ilog⁡P(ui(Oj)∣ui−k(Oj),…,ui−1(Oj);Θ)L_{1}(U(O_{j})) = \sum_{i} \log P(u_{i}(O_{j}) | u_{i-k}(O_{j}), \ldots, u_{i-1}(O_{j}); \Theta)L1(U(Oj))=∑ilogP(ui(Oj)∣ui−k(Oj),…,ui−1(Oj);Θ)

Context Window Size (k)

The parameter kkk represents the size of the context window used to predict each token.

Conditional Probability and Neural Network Parameters

The conditional probability PPP is modeled using a neural network with parameters Θ\ThetaΘ. These parameters are optimized using stochastic gradient descent.

Model Architecture

Multi-layer Transformer Decoder

The model employs a multi-layer Transformer decoder architecture.

Multi-headed Self-Attention Operation

This operation allows the model to focus on different parts of the input context.

Position-wise Feedforward Layers

These layers apply transformations to the input tokens independently.

Output Distribution Over Target Tokens

The model produces an output distribution over the possible target tokens.

Detailed Equations and Explanations

Context Vector of Tokens (U)

U=(u−k,…,u−1)U = (u_{-k}, \ldots, u_{-1})U=(u−k,…,u−1)

Token Embedding Matrix (W_e)

h0=UWe+Wph_{0} = UW_{e} + W_{p}h0=UWe+Wp

Position Embedding Matrix (W_p)

Embeddings are added to the token embeddings to incorporate positional information.

Transformer Block Operations

h_{l} = \text{transformer_block}(h_{l-1}) \quad \forall i \in [1, n]

Softmax Operation for Output Distribution

P(u)=softmax(hnWeT)P(u) = \text{softmax}(h_{n}W_{e}^{T})P(u)=softmax(hnWeT)

PreviousWhat is DGPT?NextDecentralized Supervised Fine-Tuning (DSFT)

Last updated 11 months ago