👥Decentralized Supervised Fine-Tuning (DSFT)

Transition from DUPT to DSFT
After completing the pre-training phase using DUPT, the model parameters are adapted to a supervised task using labeled data.
Labeled Dataset Assumptions
Oracle-Labeled Dataset
Each Oracle OjO_{j}Oj has a labeled dataset C(Oj)C(O_{j})C(Oj) for fine-tuning.
Sequence of Input Tokens and Labels
x1(Oj),…,xm(Oj)x_{1}(O_{j}), \ldots, x_{m}(O_{j})x1(Oj),…,xm(Oj) y(Oj)y(O_{j})y(Oj)
Supervised Target Task
Activation of the Final Transformer Block
The input tokens are processed through the pre-trained model to obtain the final activations.
Linear Output Layer with Parameters W_y
A linear output layer is added to predict the target labels.
Prediction and Objective Maximization
Softmax Operation for Prediction
P(y(Oj)∣x1(Oj),…,xm(Oj))=softmax(hlm(Oj)Wy(Oj))P(y(O_{j})|x_{1}(O_{j}), \ldots, x_{m}(O_{j})) = \text{softmax}(h_{l}^{m}(O_{j}) W_{y}(O_{j}))P(y(Oj)∣x1(Oj),…,xm(Oj))=softmax(hlm(Oj)Wy(Oj))
Formula: Likelihood (L2)
L2(C(Oj))=∑(x,y)logP(y(Oj)∣x1(Oj),…,xm(Oj))L_{2}(C(O_{j})) = \sum_{(x,y)} \log P(y(O_{j})|x_{1}(O_{j}), \ldots, x_{m}(O_{j}))L2(C(Oj))=∑(x,y)logP(y(Oj)∣x1(Oj),…,xm(Oj))
Auxiliary Objectives and Their Benefits
Improving Generalization
Including language modeling as an auxiliary objective helps the model generalize better.
Accelerating Convergence
The auxiliary objective also speeds up the convergence during training.
Combined Objective Formula (L3)
L3(C(Oj))=L2(C(Oj))+λ⋅L1(C(Oj))L_{3}(C(O_{j})) = L_{2}(C(O_{j})) + \lambda \cdot L_{1}(C(O_{j}))L3(C(Oj))=L2(C(Oj))+λ⋅L1(C(Oj))
Last updated