logo资料库

Meta大模型论文:The Llama 3 Herd of Models.pdf

第1页 / 共92页
第2页 / 共92页
第3页 / 共92页
第4页 / 共92页
第5页 / 共92页
第6页 / 共92页
第7页 / 共92页
第8页 / 共92页
资料共92页,剩余部分请下载后查看
Introduction
General Overview
Pre-Training
Pre-Training Data
Web Data Curation
Determining the Data Mix
Annealing Data
Model Architecture
Scaling Laws
Infrastructure, Scaling, and Efficiency
Training Infrastructure
Parallelism for Model Scaling
Collective Communication
Reliability and Operational Challenges
Training Recipe
Initial Pre-Training
Long Context Pre-Training
Annealing
Post-Training
Modeling
Chat Dialog Format
Reward Modeling
Supervised Finetuning
Direct Preference Optimization
Model Averaging
Iterative Rounds
Post-training Data
Preference Data
SFT Data
Data Processing and Quality Control
Capabilities
Code
Multilinguality
Math and Reasoning
Long Context
Tool Use
Factuality
Steerability
Results
Pre-trained Language Model
Standard Benchmarks
Model Robustness
Adversarial Benchmarks
Contamination Analysis
Post-trained Language Model
General Knowledge and Instruction-Following Benchmarks
Proficiency Exams
Coding Benchmarks
Multilingual Benchmarks
Math and Reasoning Benchmarks
Long Context Benchmarks
Tool Use Performance
Human Evaluations
Safety
Benchmark Construction
Safety Pre-training
Safety Finetuning
Safety Results
Cybersecurity and Chemical/Biological Weapons Safety
Red Teaming
System Level Safety
Limitations
Inference
Pipeline Parallelism
FP8 Quantization
Vision Experiments
Data
Image Data
Video Data
Model Architecture
Model Scaling
Pre-training
Post-Training
Supervised Finetuning Data
Supervised Finetuning Recipe
Preference Data
Reward Modeling
Direct Preference Optimization
Rejection Sampling
Quality Tuning
Image Recognition Results
Video Recognition Results
Speech Experiments
Data
Speech Understanding
Speech Generation
Model Architecture
Speech Understanding
Speech Generation
Training Recipe
Speech Understanding
Speech Generation
Speech Understanding Results
Speech Generation Results
Related Work
Language
Multimodality
Conclusion
The Llama 3 Herd of Models Llama Team, AI @ Meta1 1A detailed contributor list can be found in the appendix of this paper. Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. Date: July 23, 2024 Website: https://llama.meta.com/ 1 Introduction Foundation models are general models of language, vision, speech, and/or other modalities that are designed to support a large variety of AI tasks. They form the basis of many modern AI systems. The development of modern foundation models consists of two main stages: (1) a pre-training stage in which the model is trained at massive scale using straightforward tasks such as next-word prediction or captioning and (2) a post-training stage in which the model is tuned to follow instructions, align with human preferences, and improve specific capabilities (for example, coding and reasoning). In this paper, we present a new set of foundation models for language, called Llama 3. The Llama 3 Herd of models natively supports multilinguality, coding, reasoning, and tool usage. Our largest model is dense Transformer with 405B parameters, processing information in a context window of up to 128K tokens. Each member of the herd is listed in Table 1. All the results presented in this paper are for the Llama 3.1 models, which we will refer to as Llama 3 throughout for brevity. We believe there are three key levers in the development of high-quality foundation models: data, scale, and managing complexity. We seek to optimize for these three levers in our development process: • Data. Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and quality of the data we use for pre-training and post-training. These improvements include the development of more careful pre-processing and curation pipelines for pre-training data and the development of more rigorous quality assurance and filtering approaches for post-training data. We pre-train Llama 3 on a corpus of about 15T multilingual tokens, compared to 1.8T tokens for Llama 2. • Scale. We train a model at far larger scale than previous Llama models: our flagship language model was pre-trained using 3.8 × 1025 FLOPs, almost 50× more than the largest version of Llama 2. Specifically, we pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens. As expected per 1
Finetuned Multilingual Long context Tool use Release Llama 3 8B Llama 3 8B Instruct Llama 3 70B Llama 3 70B Instruct Llama 3.1 8B Llama 3.1 8B Instruct Llama 3.1 70B Llama 3.1 70B Instruct Llama 3.1 405B Llama 3.1 405B Instruct           1  1                            April 2024 April 2024 April 2024 April 2024 July 2024 July 2024 July 2024 July 2024 July 2024 July 2024 Table 1 Overview of the Llama 3 Herd of models. All results in this paper are for the Llama 3.1 models. scaling laws for foundation models, our flagship model outperforms smaller models trained using the same procedure. While our scaling laws suggest our flagship model is an approximately compute-optimal size for our training budget, we also train our smaller models for much longer than is compute-optimal. The resulting models perform better than compute-optimal models at the same inference budget. We use the flagship model to further improve the quality of those smaller models during post-training. • Managing complexity. We make design choices that seek to maximize our ability to scale the model development process. For example, we opt for a standard dense Transformer model architecture (Vaswani et al., 2017) with minor adaptations, rather than for a mixture-of-experts model (Shazeer et al., 2017) to maximize training stability. Similarly, we adopt a relatively simple post-training procedure based on supervised finetuning (SFT), rejection sampling (RS), and direct preference optimization (DPO; Rafailov et al. (2023)) as opposed to more complex reinforcement learning algorithms (Ouyang et al., 2022; Schulman et al., 2017) that tend to be less stable and harder to scale. The result of our work is Llama 3: a herd of three multilingual1 language models with 8B, 70B, and 405B parameters. We evaluate the performance of Llama 3 on a plethora of benchmark datasets that span a wide range of language understanding tasks. In addition, we perform extensive human evaluations that compare Llama 3 with competing models. An overview of the performance of the flagship Llama 3 model on key benchmarks is presented in Table 2. Our experimental evaluation suggests that our flagship model performs on par with leading language models such as GPT-4 (OpenAI, 2023a) across a variety of tasks, and is close to matching the state-of-the-art. Our smaller models are best-in-class, outperforming alternative models with similar numbers of parameters (Bai et al., 2023; Jiang et al., 2023). Llama 3 also delivers a much better balance between helpfulness and harmlessness than its predecessor (Touvron et al., 2023b). We present a detailed analysis of the safety of Llama 3 in Section 5.4. We are publicly releasing all three Llama 3 models under an updated version of the Llama 3 Community License; see https://llama.meta.com. This includes pre-trained and post-trained versions of our 405B parameter language model and a new version of our Llama Guard model (Inan et al., 2023) for input and output safety. We hope that the open release of a flagship model will spur a wave of innovation in the research community, and accelerate a responsible path towards the development of artificial general intelligence (AGI). As part of the Llama 3 development process we also develop multimodal extensions to the models, enabling image recognition, video recognition, and speech understanding capabilities. These models are still under active development and not yet ready for release. In addition to our language modeling results, the paper presents results of our initial experiments with those multimodal models. 1The Llama 3 8B and 70B were pre-trained on multilingual data but were intended for use in English at the time. 2
B 8 3 a m a L l Code Math General Category Benchmark 69.4 MMLU (5-shot) MMLU (0-shot, CoT) 73.0 MMLU-Pro (5-shot, CoT) 48.3 IFEval 80.4 HumanEval (0-shot) 72.6 MBPP EvalPlus (0-shot) 72.8 GSM8K (8-shot, CoT) 84.5 MATH (0-shot, CoT) 51.9 83.4 ARC Challenge (0-shot) 32.8 GPQA (0-shot, CoT) BFCL 76.1 Nexus 38.5 ZeroSCROLLS/QuALITY 81.0 65.1 InfiniteBench/En.MC 98.8 NIH/Multi-needle Multilingual MGSM (0-shot, CoT) 68.9 Long context Reasoning Tool use l B 7 B 9 2 a m a r m t s e i M G 61.1 72.3 72.3 60.5 36.9 – 57.6 40.2 49.5 53.2 13.0 74.2 28.8 60.4 24.7 73.6 54.3 71.7 76.7 44.3 87.6 30.0 – – 53.2 29.9 – – – – – – B 2 2 x 8 l B 0 7 3 a m a L l i a r t x M 76.9 83.6 86.0 79.9 56.3 66.4 72.7 87.5 75.6 80.5 86.0 78.6 88.2 95.1 68.0 54.1 88.7 94.8 33.3 46.7 84.8 56.7 90.5 78.2 97.5 86.9 48.5 71.1 – – – – B 0 4 3 4 n o r t o m e N 82.6 78.7 62.7 85.1 73.2 72.8 92.3♦ 41.1 94.6 86.5 – – – – – – . l t e n n o S 5 3 e d u a C 89.9 88.3 77.0 88.0 92.0 90.5 96.4♦ 71.1 96.7 59.4 90.2 45.7 90.5 – 90.8 91.6 ) 5 2 1 0 ( o 4 4 - - T T P P G G 89.1 85.1 85.4 88.7 74.0 64.8 85.6 84.3 90.2 86.6 87.8 83.6 96.1 94.2 64.5 76.6 96.7 96.4 53.6 41.4 80.5 88.3 56.1 50.3 90.5 95.2 72.1 82.5 100.0 100.0 85.9 90.5 B 5 0 4 3 a m a L l 87.3 88.6 73.3 88.6 89.0 88.6 96.8 73.8 96.9 51.1 88.5 58.7 95.2 83.4 98.1 91.6 . o b r u T 5 3 T P G 70.7 69.8 49.2 69.9 68.0 82.0 81.6 43.1 83.7 30.8 85.9 37.2 – – – 51.4 Table 2 Performance of finetuned Llama 3 models on key benchmark evaluations. The table compares the performance of the 8B, 70B, and 405B versions of Llama 3 with that of competing models. We boldface the best-performing model in each of three model-size equivalence classes. Results obtained using 5-shot prompting (no CoT). Results obtained without CoT. ♦Results obtained using zero-shot prompting. 2 General Overview The model architecture of Llama 3 is illustrated in Figure 1. The development of our Llama 3 language models comprises two main stages: • Language model pre-training. We start by converting a large, multilingual text corpus to discrete tokens and pre-training a large language model (LLM) on the resulting data to perform next-token prediction. In the language model pre-training stage, the model learns the structure of language and obtains large amounts of knowledge about the world from the text it is “reading”. To do this effectively, pre-training is performed at massive scale: we pre-train a model with 405B parameters on 15.6T tokens using a context window of 8K tokens. This standard pre-training stage is followed by a continued pre-training stage that increases the supported context window to 128K tokens. See Section 3 for details. • Language model post-training. The pre-trained language model has a rich understanding of language but it does not yet follow instructions or behave in the way we would expect an assistant to. We align the model with human feedback in several rounds, each of which involves supervised finetuning (SFT) on instruction tuning data and Direct Preference Optimization (DPO; Rafailov et al., 2024). At this post-training2 stage, we also integrate new capabilities, such as tool-use, and observe strong improvements in other areas, such as coding and reasoning. See Section 4 for details. Finally, safety mitigations are also incorporated into the model at the post-training stage, the details of which are described in Section 5.4. The resulting models have a rich set of capabilities. They can answer questions in at least eight languages, write high-quality code, solve complex reasoning problems, and use tools out-of-the-box or in a zero-shot way. We also perform experiments in which we add image, video, and speech capabilities to Llama 3 using a compositional approach. The approach we study comprises the three additional stages illustrated in Figure 28: • Multi-modal encoder pre-training. We train separate encoders for images and speech. We train our image encoder on large amounts of image-text pairs. This teaches the model the relation between visual content and the description of that content in natural language. Our speech encoder is trained using a 2In this paper, we use the term “post-training” to refer to any model training that happens outside of pre-training. 3
Figure 1 Illustration of the overall architecture and training of Llama 3. Llama 3 is a Transformer language model trained to predict the next token of a textual sequence. See text for details. self-supervised approach that masks out parts of the speech inputs and tries to reconstruct the masked out parts via a discrete-token representation. As a result, the model learns the structure of speech signals. See Section 7 for details on the image encoder and Section 8 for details on the speech encoder. • Vision adapter training. We train an adapter that integrates the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image- encoder representations into the language model. The adapter is trained on text-image pairs. This aligns the image representations with the language representations. During adapter training, we also update the parameters of the image encoder but we intentionally do not update the language-model parameters. We also train a video adapter on top of the image adapter on paired video-text data. This enables the model to aggregate information across frames. See Section 7 for details. • Speech adapter training. Finally, we integrate the speech encoder into the model via an adapter that converts speech encodings into token representations that can be fed directly into the finetuned language model. The parameters of the adapter and encoder are jointly updated in a supervised finetuning stage to enable high-quality speech understanding. We do not change the language model during speech adapter training. We also integrate a text-to-speech system. See Section 8 for details. Our multimodal experiments lead to models that can recognize the content of images and videos, and support interaction via a speech interface. These models are still under development and not yet ready for release. 3 Pre-Training Language model pre-training involves: (1) the curation and filtering of a large-scale training corpus, (2) the development of a model architecture and corresponding scaling laws for determining model size, (3) the development of techniques for efficient pre-training at large scale, and (4) the development of a pre-training recipe. We present each of these components separately below. 3.1 Pre-Training Data We create our dataset for language model pre-training from a variety of data sources containing knowledge until the end of 2023. We apply several de-duplication methods and data cleaning mechanisms on each data source to obtain high-quality tokens. We remove domains that contain large amounts of personally identifiable information (PII), and domains with known adult content. 3.1.1 Web Data Curation Much of the data we utilize is obtained from the web and we describe our cleaning process below. PII and safety filtering. Among other mitigations, we implement filters designed to remove data from websites are likely to contain unsafe content or high volumes of PII, domains that have been ranked as harmful according to a variety of Meta safety standards, and domains that are known to contain adult content. 4
Text extraction and cleaning. We process the raw HTML content for non-truncated web documents to extract high-quality diverse text. To do so, we build a custom parser that extracts the HTML content and optimizes for precision in boilerplate removal and content recall. We evaluate our parser’s quality in human evaluations, comparing it with popular third-party HTML parsers that optimize for article-like content, and found it to perform favorably. We carefully process HTML pages with mathematics and code content to preserve the structure of that content. We maintain the image alt attribute text since mathematical content is often represented as pre-rendered images where the math is also provided in the alt attribute. We experimentally evaluate different cleaning configurations. We find markdown is harmful to the performance of a model that is primarily trained on web data compared to plain text, so we remove all markdown markers. De-duplication. We apply several rounds of de-duplication at the URL, document, and line level: • URL-level de-duplication. We perform URL-level de-duplication across the entire dataset. We keep the most recent version for pages corresponding to each URL. • Document-level de-duplication. We perform global MinHash (Broder, 1997) de-duplication across the entire dataset to remove near duplicate documents. • Line-level de-duplication. We perform aggressive line-level de-duplication similar to ccNet (Wenzek et al., 2019). We remove lines that appeared more than 6 times in each bucket of 30M documents. Although our manual qualitative analysis showed that the line-level de-duplication removes not only leftover boilerplate from various websites such as navigation menus, cookie warnings, but also frequent high-quality text, our empirical evaluations showed strong improvements. Heuristic filtering. We develop heuristics to remove additional low-quality documents, outliers, and documents with excessive repetitions. Some examples of heuristics include: • We use duplicated n-gram coverage ratio (Rae et al., 2021) to remove lines that consist of repeated content such as logging or error messages. Those lines could be very long and unique, hence cannot be filtered by line-dedup. • We use “dirty word” counting (Raffel et al., 2020) to filter out adult websites that are not covered by domain block lists. • We use a token-distribution Kullback-Leibler divergence to filter out documents containing excessive numbers of outlier tokens compared to the training corpus distribution. Model-based quality filtering. Further, we experiment with applying various model-based quality classifiers to sub-select high-quality tokens. These include using fast classifiers such as fasttext (Joulin et al., 2017) trained to recognize if a given text would be referenced by Wikipedia (Touvron et al., 2023a), as well as more compute-intensive Roberta-based classifiers (Liu et al., 2019a) trained on Llama 2 predictions. To train a quality classifier based on Llama 2, we create a training set of cleaned web documents, describe the quality requirements, and instruct Llama 2’s chat model to determine if the documents meets these requirements. We use DistilRoberta (Sanh et al., 2019) to generate quality scores for each document for efficiency reasons. We experimentally evaluate the efficacy of various quality filtering configurations. Code and reasoning data. Similar to DeepSeek-AI et al. (2024), we build domain-specific pipelines that extract code and math-relevant web pages. Specifically, both the code and reasoning classifiers are DistilledRoberta models trained on web data annotated by Llama 2. Unlike the general quality classifier mentioned above, we conduct prompt tuning to target web pages containing math deduction, reasoning in STEM areas and code interleaved with natural language. Since the token distribution of code and math is substantially different than that of natural language, these pipelines implement domain-specific HTML extraction, customized text features and heuristics for filtering. Multilingual data. Similar to our processing pipelines for English described above, we implement filters to remove data from websites that are likely to contain PII or unsafe content. Our multilingual text processing pipeline has several unique features: • We use a fasttext-based language identification model to categorize documents into 176 languages. • We perform document-level and line-level de-duplication within data for each language. 5
• We apply language-specific heuristics and model-based filters to remove low-quality documents. In addition, we perform quality ranking of multilingual documents using a multilingual Llama 2-based classifier to ensure that high-quality content is prioritized. We determine the amount of multilingual tokens used in pre-training experimentally, balancing model performance on English and multilingual benchmarks. 3.1.2 Determining the Data Mix To obtain a high-quality language model, it is essential to carefully determine the proportion of different data sources in the pre-training data mix. Our main tools in determining this data mix are knowledge classification and scaling law experiments. Knowledge classification. We develop a classifier to categorize the types of information contained in our web data to more effectively determine a data mix. We use this classifier to downsample data categories that are over-represented on the web, for example, arts and entertainment. Scaling laws for data mix. To determine the best data mix, we perform scaling law experiments in which we train several small models on a data mix and use that to predict the performance of a large model on that mix (see Section 3.2.1). We repeat this process multiple times for different data mixes to select a new data mix candidate. Subsequently, we train a larger model on this candidate data mix and evaluate the performance of that model on several key benchmarks. Data mix summary. Our final data mix contains roughly 50% of tokens corresponding to general knowledge, 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens. 3.1.3 Annealing Data Empirically, we find that annealing (see Section 3.4.3) on small amounts of high-quality code and mathematical data can boost the performance of pre-trained models on key benchmarks. Akin to Li et al. (2024b), we perform annealing with a data mix that upsamples high-quality data in select domains. We do not include any training sets from commonly used benchmarks in our annealing data. This enables us to assess the true few-shot learning capabilities and out-of-domain generalization of Llama 3. Following OpenAI (2023a), we evaluate the efficacy of annealing on the GSM8k (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021b) training sets in annealing. We find that annealing improved the performance of a pre-trained Llama 3 8B model on the GSM8k and MATH validation sets by 24.0% and 6.4%, respectively. However, the improvements on the 405B model are negligible, suggesting that our flagship model has strong in-context learning and reasoning capabilities and does not require specific in-domain training samples to obtain strong performance. Using annealing to assess data quality. Similar to Blakeney et al. (2024), we find that annealing enables us to judge the value of small domain-specific datasets. We measure the value of such datasets by annealing the learning rate of a 50% trained Llama 3 8B model linearly to 0 on 40B tokens. In those experiments, we assign 30% weight to the new dataset and the remaining 70% weight to the default data mix. Using annealing to evaluate new data sources is more efficient than performing scaling law experiments for every small dataset. 3.2 Model Architecture Llama 3 uses a standard, dense Transformer architecture (Vaswani et al., 2017). It does not deviate significantly from Llama and Llama 2 (Touvron et al., 2023a,b) in terms of model architecture; our performance gains are primarily driven by improvements in data quality and diversity as well as by increased training scale. We do make a few smaller modifications compared to Llama 3: • We use grouped query attention (GQA; Ainslie et al. (2023)) with 8 key-value heads to improve inference speed and to reduce the size of key-value caches during decoding. • We use an attention mask that prevents self-attention between different documents within the same sequence. We find that this change had limited impact during in standard pre-training, but find it to be important in continued pre-training on very long sequences. 6
8B 32 4,096 6,144 32 8 3 × 10−4 Layers Model Dimension FFN Dimension Attention Heads Key/Value Heads Peak Learning Rate Activation Function Vocabulary Size Positional Embeddings 70B 80 8192 12,288 64 8 405B 126 16,384 20,480 128 8 8 × 10−5 1.5 × 10−4 SwiGLU 128,000 RoPE (θ = 500, 000) Table 3 Overview of the key hyperparameters of Llama 3. We display settings for 8B, 70B, and 405B language models. • We use a vocabulary with 128K tokens. Our token vocabulary combines 100K tokens from the tiktoken3 tokenizer with 28K additional tokens to better support non-English languages. Compared to the Llama 2 tokenizer, our new tokenizer improves compression rates on a sample of English data from 3.17 to 3.94 characters per token. This enables the model to “read” more text for the same amount of training compute. We also found that adding 28K tokens from select non-English languages improved both compression ratios and downstream performance, with no impact on English tokenization. • We increase the RoPE base frequency hyperparameter to 500,000. This enables us to better support longer contexts; Xiong et al. (2023) showed this value to be effective for context lengths up to 32,768. Llama 3 405B uses an architecture with 126 layers, a token representation dimension of 16,384, and 128 attention heads; see Table 3 for details. This leads to a model size that is approximately compute-optimal according to scaling laws on our data for our training budget of 3.8 × 1025 FLOPs. 3.2.1 Scaling Laws We develop scaling laws (Hoffmann et al., 2022; Kaplan et al., 2020) to determine the optimal model size for our flagship model given our pre-training compute budget. In addition to determining the optimal model size, a major challenge is to forecast the flagship model’s performance on downstream benchmark tasks, due to a couple of issues: (1) Existing scaling laws typically predict only next-token prediction loss rather than specific benchmark performance. (2) Scaling laws can be noisy and unreliable because they are developed based on pre-training runs conducted with small compute budgets (Wei et al., 2022b). To address these challenges, we implement a two-stage methodology to develop scaling laws that accurately predict downstream benchmark performance: 1. We first establish a correlation between the compute-optimal model’s negative log-likelihood on down- stream tasks and the training FLOPs. 2. Next, we correlate the negative log-likelihood on downstream tasks with task accuracy, utilizing both the scaling law models and older models trained with higher compute FLOPs. In this step, we specifically leverage the Llama 2 family of models. This approach enables us to predict downstream task performance given a specific number of training FLOPs for compute-optimal models. We use a similar method to select our pre-training data mix (see Section 3.4). Scaling law experiments. Concretely, we construct our scaling laws by pre-training models using compute budgets between 6 × 1018 FLOPs and 1022 FLOPs. At each compute budget, we pre-train models ranging in size between 40M and 16B parameters, using a subset of model sizes at each compute budget. In these training runs, we use a cosine learning rate schedule with a linear warmup for 2,000 training steps. The peak learning rate is set between 2 × 10−4 and 4 × 10−4 depending on the size of the model. We set the cosine decay to 0.1 of the peak value. The weight decay at each step is set to 0.1 times the learning rate at that step. We use a fixed batch size for each compute scale, ranging between 250K and 4M. 3https://github.com/openai/tiktoken/tree/main 7
Figure 2 Scaling law IsoFLOPs curves between 6 × 1018 and 1022 FLOPs. The loss is the negative log- likelihood on a held-out validation set. We approx- imate measurements at each compute scale using a second degree polynomial. Figure 3 Number of training tokens in identified compute- optimal models as a function of pre-training compute budget. We include the fitted scaling-law prediction as well. The compute-optimal models correspond to the parabola minimums in Figure 2. These experiments give rise to the IsoFLOPs curves in Figure 2. The loss in these curves is measured on a separate validation set. We fit the measured loss values using a second-degree polynomial and identify the minimums of each parabola. We refer to minimum of a parabola as the compute-optimal model at the corresponding pre-training compute budget. We use the compute-optimal models we identified this way to predict the optimal number of training tokens for a specific compute budget. To do so, we assume a power-law relation between compute budget, C, and the optimal number of training tokens, N (C): N (C) = AC α. We fit A and α using the data from Figure 2. We find that (α, A) = (0.53, 0.29); the corresponding fit is shown in Figure 3. Extrapolation of the resulting scaling law to 3.8 × 1025 FLOPs suggests training a 402B parameter model on 16.55T tokens. An important observation is that IsoFLOPs curves become flatter around the minimum as the compute budget increases. This implies that performance of the flagship model is relatively robust to small changes in the trade-off between model size and training tokens. Based on this observation, we ultimately decided to train a flagship model with 405B parameters. Predicting performance on downstream tasks. We use the resulting compute-optimal models to forecast the performance of the flagship Llama 3 model on benchmark data sets. First, we linearly correlate the (normalized) negative log-likelihood of correct answer in the benchmark and the training FLOPs. In this analysis, we use only the scaling law models trained up to 1022 FLOPs on the data mix described above. Next, we establish a sigmoidal relation between the log-likelihood and accuracy using both the scaling law models and Llama 2 models, which were trained using the Llama 2 data mix and tokenizer. We show the results of this experiment on the ARC Challenge benchmark in Figure 4). We find this two-step scaling law prediction, which extrapolates over four orders of magnitude, to be quite accurate: it only slightly underestimates the final performance of the flagship Llama 3 model. 3.3 Infrastructure, Scaling, and Efficiency We describe our hardware and infrastructure that powered Llama 3 405B pre-training at scale and discuss several optimizations that leads to improvements in training efficiency. 3.3.1 Training Infrastructure The Llama 1 and 2 models were trained on Meta’s AI Research SuperCluster (Lee and Sengupta, 2022). As we scaled further, the training for Llama 3 was migrated to Meta’s production clusters (Lee et al., 2024).This 8 101010111012Training Tokens0.700.750.800.850.900.95Validation LossCompute6e181e193e196e191e203e206e201e213e211e221019102010211022Compute (FLOPs)10101011Training TokensFitted Line, = 0.537, A = 0.299
分享到:
收藏