| Aug 12, 2025 | ON THE EXPRESSIVENESS OF SOFTMAX ATTENTION: A RECURRENT NEURAL NETWORK PERSPECTIVE |
| Aug 05, 2025 | Impact of Fine-Tuning Methods on Memorization in Large Language Models |
| Aug 05, 2025 | BLOCK DIFFUSION: INTERPOLATING BETWEEN AUTOREGRESSIVE AND DIFFUSION LANGUAGE MODELS |
| Jul 15, 2025 | Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models |
| Jul 15, 2025 | Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models |
| Jun 24, 2025 | See What You Are Told: Visual Attention Sink in Large Multimodal Models |
| Jun 10, 2025 | Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction |
| Apr 22, 2025 | Fine-tuning Vision-Language-Action Models: Optimizing Speed and Success |
| Apr 08, 2025 | On the Biology of a Large Language Model |
| Mar 11, 2025 | WHEN IS TASK VECTOR Provably EFFECTIVE FOR MODEL EDITING? A GENERALIZATION ANALYSIS OF NONLINEAR TRANSFORMERS |
| Mar 04, 2025 | Contextual Document Embeddings |
| Feb 18, 2025 | DeepSeek v3 |
| Feb 04, 2025 | Titans: Learning to Memorize at Test Time |
| Feb 04, 2025 | SSM → HIPPO → LSSL → S4 → Mamba → Mamba2 |
| Jan 02, 2025 | Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection |
| Jan 02, 2025 | d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning |
| Sep 02, 2024 | LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders |
| Aug 13, 2024 | Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process |
| Aug 13, 2024 | Knowledge conflict survey |
| Jul 30, 2024 | In-Context Retrieval-Augmented Language Models |
| Jun 11, 2024 | Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet |
| Jun 11, 2024 | Contextual Position Encoding: Learning to Count What’s Important |
| Jun 04, 2024 | Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training |
| May 21, 2024 | LLAMA PRO: Progressive LLaMA with Block Expansion |
| May 07, 2024 | How to Train LLM? - From Data Parallel To Fully Sharded Data Parallel |
| May 07, 2024 | How to Inference Big LLM? - Using Accelerate Library |
| Mar 11, 2024 | BitNet: Scaling 1-bit Transformers for Large Language Models |
| Feb 27, 2024 | SELF-RAG: LEARNING TO RETRIEVE, GENERATE, AND CRITIQUE THROUGH SELF-REFLECTION |
| Feb 13, 2024 | LLM AUGMENTED LLMS: EXPANDING CAPABILITIES THROUGH COMPOSITION |
| Jan 23, 2024 | OVERTHINKING THE TRUTH: UNDERSTANDING HOW LANGUAGE MODELS PROCESS FALSE DEMONSTRATIONS |
| Jan 16, 2024 | Mistral 7B & Mixtral (Mixtral of Experts) |
| Jan 16, 2024 | BENCHMARKING COGNITIVE BIASES IN LARGE LANGUAGE MODELS AS EVALUATORS |
| Jan 09, 2024 | Making Large Language Models A Better Foundation For Dense Retrieval |
| Jan 03, 2024 | vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention |
| Dec 19, 2023 | Break the Sequential Dependency of LLM Inference Using Lookahead Decoding |
| Dec 12, 2023 | Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning |
| Oct 31, 2023 | EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS |
| Oct 10, 2023 | LongLoRA: Efficient Fine-Tuning of Long-Context Large Language Models |
| Oct 03, 2023 | DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models |
| Sep 12, 2023 | A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training |
| Aug 29, 2023 | Code Llama: Open Foundation Models for Code |
| Apr 13, 2023 | P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks |
| Apr 13, 2023 | AdapterDrop: On the Efficiency of Adapters in Transformers |
| Mar 16, 2023 | Calibrating Factual Knowledge in Pretrained Language Models |