Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection

논문 정보

Date: 2025-01-02
Reviewer: 조영재
Property: LM

Evaluation을 할때 많은 벤치 데이터셋들은 multiple choice 형태를 갖추고 있음. 근데 아직까지 작은 모델들은 instruction following이 잘 안돼서 파싱도 맞춰줘야 하고 프롬프트도 모델에 맞게 수정해야함. 혹은 모델의 답변을 gpt 사용해서 다시 mapping 시켜줘야함. → 모델에 따라 다른 프롬프트를 사용하는게 불공평 한 것 아닌가? 모두를 통합할 수 있는 방법은 없을까? logit을 이용해 무조건 mapping 시킬 수는 없을까?

uncertainty가 조금 문제가 될수도…? 또, A라는 대답이 아니라 As, An apple 등등 A 로 시작하는 단어를 말하려던 것일수도 있겠다.. 실제 model의 답변이 첫 token step의 logit과 상관관계가 있을까? logit을 이용해 측정을 하는게 편하긴 할텐데 의미가 있을까? 어쨌든 유저가 보게 되는 것은 text인데 어떻게든 text output 형태를 고수해야 하는가? 이 eval 방식은 cot를 간과하는 것이려나?

1. Introduction

Motivation: While LLMs typically rely on autoregressive decoding (token-by-token generation), many real-world tasks involve selecting an answer from a candidate pool—such as multiple-choice QA or clinical decision-making.
Problem: Full decoding is slow and breaks gradient flow; thus, decoding-free methods are increasingly used (i.e., using initial logits only).
Contribution:
- Provides the first formal definition of decoding-free generative candidate selection.
- Performs a comprehensive empirical evaluation across diverse tasks (QA + clinical).
- Compares 5 decoding-free methods, full decoding, and dense retrieval.

2. Problem Formulation

2.1 Generative LMs

traditional full decoding

full decoding with Cand. Selection

2.4 Decoding-Free Generative Candidate Selection

Goal: Estimate the candidate probabilities directly from the first-step logits without generating tokens.

Uses only the raw logits and candidate token representations.

3. Candidate Selection Methods

3.1 Estimation via Logits

First-token logit: Use logits of first token in candidate
Last-token logit
Average logits: Mean over all candidate token logits
Sum logits: Sum over all candidate token logits
Sample Avg.: Average logits of sampled tokens (used for long candidates)

logits of k-th token(4) Averaged token logits (5) Sum of token logits (6)

3.2 Baselines

Full decoding: Followed by mapping output to candidate
Dense retrieval: Facebook DPR embeddings with cosine similarity

4. Evaluation Settings

4.1 Tasks

Limited-candidate tasks (3–5 options):
- CommonsenseQA, MMLU, GPQA, BIG-Bench, ARC
Massive-candidate tasks (1K–94K options):
- Diagnosis (ICD-10), Procedures (ICD-10-PCS), Lab Orders (LOINC), Prescriptions (ATC)

4.2 Base LMs

Decoder-only: LLaMA3 (8B), Mistral (7.3B)
Encoder-decoder: Flan-T5 XL (11B)
Variants with and without instruction tuning used

5. Experimental Results

Key Insights

Insight 1: Estimation methods are strong when full decoding fails (e.g., weak models or hard tasks like GPQA)
- GPQA 처럼 점수 낮은 task에선 비슷하더라
Insight 2: When full decoding works well, estimation methods fall short
- 점수가 높은 task에선 잘 못하더라
Insight 3: Instruction tuning helps full decoding but not decoding-free estimation
- instructinon tuning을 하건 안하건 점수를 확인해보면 비슷하다 (31.83 + 9.11 vs 70.70 - 38.34)
Insight 4: Method effectiveness varies with model and dataset
- 모델마다, 데이터셋마다 매우 다르다.
Insight 5: Decoding-free estimation is far more efficient (up to 57.6× speedup)

Detailed Analyses

5.2 Output step: (a)

First-step logits are the most informative and performant
5.3 Candidate token selection: (b)
- GPT를 이용해 토큰 중에 중요한 토큰만 남겨
- Using the entire sequence is better than selecting keywords
5.4 Sensitivity:

Decoder-only models benefit from increased size
Longer candidate sequences reduce performance

Classification-based methods require retraining and don’t generalize to dynamic candidates.
Retrieval models struggle with reasoning tasks.
This is the first systematic evaluation of decoding-free methods.

7. Conclusion & Future Work

Main findings:
- Estimation from logits is a viable alternative, especially when full decoding is brittle.
- First-step logits are optimal.
- Simple heuristics like token averaging or summing provide robust approximations.
Future directions:
- Use multiple output steps for estimation.
- Compress candidate sequences into key tokens using LLMs.
- Improve efficiency using techniques like PagedAttention.

Appendix

full decoding 성능 대비 logit estimation (with cot and without cot) MMLU 수학 성능표.