attention

an archive of posts with this tag

Aug 12, 2025 ON THE EXPRESSIVENESS OF SOFTMAX ATTENTION: A RECURRENT NEURAL NETWORK PERSPECTIVE
Aug 05, 2025 Impact of Fine-Tuning Methods on Memorization in Large Language Models
Aug 05, 2025 BLOCK DIFFUSION: INTERPOLATING BETWEEN AUTOREGRESSIVE AND DIFFUSION LANGUAGE MODELS
Jul 15, 2025 Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
Jul 15, 2025 Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models
Jun 24, 2025 See What You Are Told: Visual Attention Sink in Large Multimodal Models
Jun 10, 2025 Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction
Apr 22, 2025 Fine-tuning Vision-Language-Action Models: Optimizing Speed and Success
Apr 08, 2025 On the Biology of a Large Language Model
Mar 11, 2025 WHEN IS TASK VECTOR Provably EFFECTIVE FOR MODEL EDITING? A GENERALIZATION ANALYSIS OF NONLINEAR TRANSFORMERS
Mar 04, 2025 Contextual Document Embeddings
Feb 18, 2025 DeepSeek v3
Feb 04, 2025 Titans: Learning to Memorize at Test Time
Feb 04, 2025 SSM → HIPPO → LSSL → S4 → Mamba → Mamba2
Jan 02, 2025 Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection
Jan 02, 2025 d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Sep 02, 2024 LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Aug 13, 2024 Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
Aug 13, 2024 Knowledge conflict survey
Jul 30, 2024 In-Context Retrieval-Augmented Language Models
Jun 11, 2024 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Jun 11, 2024 Contextual Position Encoding: Learning to Count What’s Important
Jun 04, 2024 Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
May 21, 2024 LLAMA PRO: Progressive LLaMA with Block Expansion
May 07, 2024 How to Train LLM? - From Data Parallel To Fully Sharded Data Parallel
May 07, 2024 How to Inference Big LLM? - Using Accelerate Library
Mar 11, 2024 BitNet: Scaling 1-bit Transformers for Large Language Models
Feb 27, 2024 SELF-RAG: LEARNING TO RETRIEVE, GENERATE, AND CRITIQUE THROUGH SELF-REFLECTION
Feb 13, 2024 LLM AUGMENTED LLMS: EXPANDING CAPABILITIES THROUGH COMPOSITION
Jan 23, 2024 OVERTHINKING THE TRUTH: UNDERSTANDING HOW LANGUAGE MODELS PROCESS FALSE DEMONSTRATIONS
Jan 16, 2024 Mistral 7B & Mixtral (Mixtral of Experts)
Jan 16, 2024 BENCHMARKING COGNITIVE BIASES IN LARGE LANGUAGE MODELS AS EVALUATORS
Jan 09, 2024 Making Large Language Models A Better Foundation For Dense Retrieval
Jan 03, 2024 vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
Dec 19, 2023 Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Dec 12, 2023 Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Oct 31, 2023 EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS
Oct 10, 2023 LongLoRA: Efficient Fine-Tuning of Long-Context Large Language Models
Oct 03, 2023 DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
Sep 12, 2023 A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training
Aug 29, 2023 Code Llama: Open Foundation Models for Code
Apr 13, 2023 P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
Apr 13, 2023 AdapterDrop: On the Efficiency of Adapters in Transformers
Mar 16, 2023 Calibrating Factual Knowledge in Pretrained Language Models