👥Decentralized Supervised Fine-Tuning (DSFT)

Transition from DUPT to DSFT

After completing the pre-training phase using DUPT, the model parameters are adapted to a supervised task using labeled data.

Labeled Dataset Assumptions

Oracle-Labeled Dataset

Each Oracle OjO_{j}Oj​ has a labeled dataset C(Oj)C(O_{j})C(Oj​) for fine-tuning.

Sequence of Input Tokens and Labels

x1(Oj),…,xm(Oj)x_{1}(O_{j}), \ldots, x_{m}(O_{j})x1​(Oj​),…,xm​(Oj​) y(Oj)y(O_{j})y(Oj​)

Supervised Target Task

Activation of the Final Transformer Block

The input tokens are processed through the pre-trained model to obtain the final activations.

Linear Output Layer with Parameters W_y

A linear output layer is added to predict the target labels.

Prediction and Objective Maximization

Softmax Operation for Prediction

P(y(Oj)∣x1(Oj),…,xm(Oj))=softmax(hlm(Oj)Wy(Oj))P(y(O_{j})|x_{1}(O_{j}), \ldots, x_{m}(O_{j})) = \text{softmax}(h_{l}^{m}(O_{j}) W_{y}(O_{j}))P(y(Oj​)∣x1​(Oj​),…,xm​(Oj​))=softmax(hlm​(Oj​)Wy​(Oj​))

Formula: Likelihood (L2)

L2(C(Oj))=∑(x,y)log⁡P(y(Oj)∣x1(Oj),…,xm(Oj))L_{2}(C(O_{j})) = \sum_{(x,y)} \log P(y(O_{j})|x_{1}(O_{j}), \ldots, x_{m}(O_{j}))L2​(C(Oj​))=∑(x,y)​logP(y(Oj​)∣x1​(Oj​),…,xm​(Oj​))

Auxiliary Objectives and Their Benefits

Improving Generalization

Including language modeling as an auxiliary objective helps the model generalize better.

Accelerating Convergence

The auxiliary objective also speeds up the convergence during training.

Combined Objective Formula (L3)

L3(C(Oj))=L2(C(Oj))+λ⋅L1(C(Oj))L_{3}(C(O_{j})) = L_{2}(C(O_{j})) + \lambda \cdot L_{1}(C(O_{j}))L3​(C(Oj​))=L2​(C(Oj​))+λ⋅L1​(C(Oj​))

Last updated