P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks

논문 정보

Prompting: feeze all params of PLM and use natural language prompt to query a LM
- ex) Sentiment analysis: “Amazing movie!” → “Amazing movie!” + “This movie is [MASK]” choose between (good/bad)
- Discrete prompting leads to suboptimal performance
Prompt tuning: tuning only the continuous prompts. (ex. P-tuning)
- Two problems: Exising works lack of universality of prompt tuning for NLU
  1. Model size: P-tuning (Liu et al. 2021) underperforms (less than 10B params)
  2. Hard task: poor peformance on Extractive QA task, NER task
Contributions
- Find proper optimized prompt tuning universally across model size and tasks
  - enable model scales 300M~10B params
  - show comparable performance on Extractive QA, NER task compared to Finetuning

(생략)

Lack of universality
- scale: (Lester et al. 2021) 준원 → poor performance when model size is (100M~1B)
- tasks: (Liu et al. 2021) P-tuning → perform poorly on typical sequnce tagging tasks
Deep prompt tuning

Existing problem → figure (a)
1. number of tunable params is limited due to the constraints of sequence length
2. input embeddings have relatively indirect impact on model prediction
3. prediction하는 LM head와의 거리가 너무 멀어서라고 추측
Proposed method → figure (b)
- Prompts in different layers are added as prefix tokens
  - Can have more tunable params
  - have more direct impact** **on model predictions
Optimization and implementation
- Reparameterization
  - Prior works usually use a reparameterization encoder using MLP to form trainable embeddings → but its performance depends on tasks/datasets
    - RTE/CoNLL04: MLP brings a consistent improvement
    - BoolQ/CoNLL12: MLP leads to negative effects

→ 저자가 주장하는 바와 다르게 일관성 있는 result가 보이지는 않지만, MLP를 사용하는 것이 꼭 좋은 것은 아니라는 것을 확인할 수 있음. (이래서 grpah를 appendix에 뺐는지..)

Prompt Length
- Length affects the performance
  - Simple classification task → prefer shorter prompts (less than 20)
  - Harder sequence labeling task → prefer longer prompts (around 100)
Multi-task Learning
- Optmize multiple task (pretraining) before fine-tuning for individual tasks
  - 여기서 언급하는 pretraining은 finetuning하기 직전 단계에 학습 단계
  - multiple task는 서로 다른 task는 아니고, 하나의 task에서 다양한 dataset을 지칭
    - ex) SRL task: combine CoNLL05, CoNLL12
- Boost performance
**Classification Head **
- Prior works usually use LM head to predict verbalizers → unnecessary in full-data and incompatible with sequence labeling
Compared to other methods

Small models (BERT/RoBERa)
- P-tuning at smaller sclaes show poor performance
- P-tuning v2 matches the finetuning performance in all tasks at a smaller scale
Large models (GLM)
- Gap between finetuning and PT/PT2 is gradually narrowed down
- P-tuning v2 is alwasys comparable to finetuning at all scales only using 0.1% task-specific params
Across Taks

→ Results indicate there is no significant difference (verbalizer는 귀찮으니 CLS쓰자)

- Deeper is good

- Order

  - Ascend: RTE랑 BoolQ 비슷한 trend

  - Descend: RTE에서 17-24만 해도 성능 비슷

Deep Prompt Tuning (all layers에 적용)은 downstream task를 training할 때, 사용
- self.prefix_encoder가 한번에 모든 layer에 prompts를 생성
- 생성된 prompts는 past_key_values를 통해 BERT로 전달
  - 저기서 선언하는 past_key_values는 이전 청크(생성된 prompts)에서 계산된 self-attention 값을 저장하는데 사용

→ 저기 past_key_values shape에서 ‘n_layer*2’를 2로 split하는게 key, value 각각을 나타내기 위함이라고 생각했는데, 저자가 말하기로는 아닌 것 같 질문 드리고 싶습니다..