Feb 18, 2025 DeepSeek v3 Jun 04, 2024 Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training May 07, 2024 How to Train LLM? - From Data Parallel To Fully Sharded Data Parallel