Meta大模型论文：The Llama 3 Herd of Models.pdf

发布时间：2024-08-02 发布人：admin 分类：模电资料大小：8.79M 资料格式：pdf 举报版权申诉

第1页 / 共92页

第2页 / 共92页

第3页 / 共92页

第4页 / 共92页

第5页 / 共92页

第6页 / 共92页

第7页 / 共92页

第8页 / 共92页

Introduction

General Overview

Pre-Training

Pre-Training Data

Web Data Curation

Determining the Data Mix

Annealing Data

Model Architecture

Scaling Laws

Infrastructure, Scaling, and Efficiency

Training Infrastructure

Parallelism for Model Scaling

Collective Communication

Reliability and Operational Challenges

Training Recipe

Initial Pre-Training

Long Context Pre-Training

Annealing

Post-Training

Modeling

Chat Dialog Format

Reward Modeling

Supervised Finetuning

Direct Preference Optimization

Model Averaging

Iterative Rounds

Post-training Data

Preference Data

SFT Data

Data Processing and Quality Control

Capabilities

Code

Multilinguality

Math and Reasoning

Long Context

Tool Use

Factuality

Steerability

Results

Pre-trained Language Model

Standard Benchmarks

Model Robustness

Adversarial Benchmarks

Contamination Analysis

Post-trained Language Model

General Knowledge and Instruction-Following Benchmarks

Proficiency Exams

Coding Benchmarks

Multilingual Benchmarks

Math and Reasoning Benchmarks

Long Context Benchmarks

Tool Use Performance

Human Evaluations

Safety

Benchmark Construction

Safety Pre-training

Safety Finetuning

Safety Results

Cybersecurity and Chemical/Biological Weapons Safety

Red Teaming

System Level Safety

Limitations

Inference

Pipeline Parallelism

FP8 Quantization

Vision Experiments

Data

Image Data

Video Data

Model Architecture

Model Scaling

Pre-training

Post-Training

Supervised Finetuning Data

Supervised Finetuning Recipe

Preference Data

Reward Modeling

Direct Preference Optimization

Rejection Sampling

Quality Tuning

Image Recognition Results

Video Recognition Results

Speech Experiments

Data

Speech Understanding

Speech Generation

Model Architecture

Speech Understanding

Speech Generation

Training Recipe

Speech Understanding

Speech Generation

Speech Understanding Results

Speech Generation Results

Related Work

Language

Multimodality

Conclusion

The Llama 3 Herd of Models Llama Team, AI @ Meta1 1A detailed contributor list can be found in the appendix of this paper. Modern artiﬁcial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We ﬁnd that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. Date: July 23, 2024 Website: https://llama.meta.com/ 1 Introduction Foundation models are general models of language, vision, speech, and/or other modalities that are designed to support a large variety of AI tasks. They form the basis of many modern AI systems. The development of modern foundation models consists of two main stages: (1) a pre-training stage in which the model is trained at massive scale using straightforward tasks such as next-word prediction or captioning and (2) a post-training stage in which the model is tuned to follow instructions, align with human preferences, and improve speciﬁc capabilities (for example, coding and reasoning). In this paper, we present a new set of foundation models for language, called Llama 3. The Llama 3 Herd of models natively supports multilinguality, coding, reasoning, and tool usage. Our largest model is dense Transformer with 405B parameters, processing information in a context window of up to 128K tokens. Each member of the herd is listed in Table 1. All the results presented in this paper are for the Llama 3.1 models, which we will refer to as Llama 3 throughout for brevity. We believe there are three key levers in the development of high-quality foundation models: data, scale, and managing complexity. We seek to optimize for these three levers in our development process: • Data. Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and quality of the data we use for pre-training and post-training. These improvements include the development of more careful pre-processing and curation pipelines for pre-training data and the development of more rigorous quality assurance and ﬁltering approaches for post-training data. We pre-train Llama 3 on a corpus of about 15T multilingual tokens, compared to 1.8T tokens for Llama 2. • Scale. We train a model at far larger scale than previous Llama models: our ﬂagship language model was pre-trained using 3.8 × 1025 FLOPs, almost 50× more than the largest version of Llama 2. Speciﬁcally, we pre-trained a ﬂagship model with 405B trainable parameters on 15.6T text tokens. As expected per 1

Finetuned Multilingual Long context Tool use Release Llama 3 8B Llama 3 8B Instruct Llama 3 70B Llama 3 70B Instruct Llama 3.1 8B Llama 3.1 8B Instruct Llama 3.1 70B Llama 3.1 70B Instruct Llama 3.1 405B Llama 3.1 405B Instruct 1 1 April 2024 April 2024 April 2024 April 2024 July 2024 July 2024 July 2024 July 2024 July 2024 July 2024 Table 1 Overview of the Llama 3 Herd of models. All results in this paper are for the Llama 3.1 models. scaling laws for foundation models, our ﬂagship model outperforms smaller models trained using the same procedure. While our scaling laws suggest our ﬂagship model is an approximately compute-optimal size for our training budget, we also train our smaller models for much longer than is compute-optimal. The resulting models perform better than compute-optimal models at the same inference budget. We use the ﬂagship model to further improve the quality of those smaller models during post-training. • Managing complexity. We make design choices that seek to maximize our ability to scale the model development process. For example, we opt for a standard dense Transformer model architecture (Vaswani et al., 2017) with minor adaptations, rather than for a mixture-of-experts model (Shazeer et al., 2017) to maximize training stability. Similarly, we adopt a relatively simple post-training procedure based on supervised ﬁnetuning (SFT), rejection sampling (RS), and direct preference optimization (DPO; Rafailov et al. (2023)) as opposed to more complex reinforcement learning algorithms (Ouyang et al., 2022; Schulman et al., 2017) that tend to be less stable and harder to scale. The result of our work is Llama 3: a herd of three multilingual1 language models with 8B, 70B, and 405B parameters. We evaluate the performance of Llama 3 on a plethora of benchmark datasets that span a wide range of language understanding tasks. In addition, we perform extensive human evaluations that compare Llama 3 with competing models. An overview of the performance of the ﬂagship Llama 3 model on key benchmarks is presented in Table 2. Our experimental evaluation suggests that our ﬂagship model performs on par with leading language models such as GPT-4 (OpenAI, 2023a) across a variety of tasks, and is close to matching the state-of-the-art. Our smaller models are best-in-class, outperforming alternative models with similar numbers of parameters (Bai et al., 2023; Jiang et al., 2023). Llama 3 also delivers a much better balance between helpfulness and harmlessness than its predecessor (Touvron et al., 2023b). We present a detailed analysis of the safety of Llama 3 in Section 5.4. We are publicly releasing all three Llama 3 models under an updated version of the Llama 3 Community License; see https://llama.meta.com. This includes pre-trained and post-trained versions of our 405B parameter language model and a new version of our Llama Guard model (Inan et al., 2023) for input and output safety. We hope that the open release of a ﬂagship model will spur a wave of innovation in the research community, and accelerate a responsible path towards the development of artiﬁcial general intelligence (AGI). As part of the Llama 3 development process we also develop multimodal extensions to the models, enabling image recognition, video recognition, and speech understanding capabilities. These models are still under active development and not yet ready for release. In addition to our language modeling results, the paper presents results of our initial experiments with those multimodal models. 1The Llama 3 8B and 70B were pre-trained on multilingual data but were intended for use in English at the time. 2

B 8 3 a m a L l Code Math General Category Benchmark 69.4 MMLU (5-shot) MMLU (0-shot, CoT) 73.0 MMLU-Pro (5-shot, CoT) 48.3 IFEval 80.4 HumanEval (0-shot) 72.6 MBPP EvalPlus (0-shot) 72.8 GSM8K (8-shot, CoT) 84.5 MATH (0-shot, CoT) 51.9 83.4 ARC Challenge (0-shot) 32.8 GPQA (0-shot, CoT) BFCL 76.1 Nexus 38.5 ZeroSCROLLS/QuALITY 81.0 65.1 InﬁniteBench/En.MC 98.8 NIH/Multi-needle Multilingual MGSM (0-shot, CoT) 68.9 Long context Reasoning Tool use l B 7 B 9 2 a m a r m t s e i M G 61.1 72.3 72.3 60.5 36.9 – 57.6 40.2 49.5 53.2 13.0 74.2 28.8 60.4 24.7 73.6 54.3 71.7 76.7 44.3 87.6 30.0 – – 53.2 29.9 – – – – – – B 2 2 x 8 l B 0 7 3 a m a L l i a r t x M 76.9 83.6 86.0 79.9 56.3 66.4 72.7 87.5 75.6 80.5 86.0 78.6 88.2 95.1 68.0 54.1 88.7 94.8 33.3 46.7 84.8 56.7 90.5 78.2 97.5 86.9 48.5 71.1 – – – – B 0 4 3 4 n o r t o m e N 82.6 78.7 62.7 85.1 73.2 72.8 92.3♦ 41.1 94.6 86.5 – – – – – – . l t e n n o S 5 3 e d u a C 89.9 88.3 77.0 88.0 92.0 90.5 96.4♦ 71.1 96.7 59.4 90.2 45.7 90.5 – 90.8 91.6 ) 5 2 1 0 ( o 4 4 - - T T P P G G 89.1 85.1 85.4 88.7 74.0 64.8 85.6 84.3 90.2 86.6 87.8 83.6 96.1 94.2 64.5 76.6 96.7 96.4 53.6 41.4 80.5 88.3 56.1 50.3 90.5 95.2 72.1 82.5 100.0 100.0 85.9 90.5 B 5 0 4 3 a m a L l 87.3 88.6 73.3 88.6 89.0 88.6 96.8 73.8 96.9 51.1 88.5 58.7 95.2 83.4 98.1 91.6 . o b r u T 5 3 T P G 70.7 69.8 49.2 69.9 68.0 82.0 81.6 43.1 83.7 30.8 85.9 37.2 – – – 51.4 Table 2 Performance of finetuned Llama 3 models on key benchmark evaluations. The table compares the performance of the 8B, 70B, and 405B versions of Llama 3 with that of competing models. We boldface the best-performing model in each of three model-size equivalence classes. Results obtained using 5-shot prompting (no CoT). Results obtained without CoT. ♦Results obtained using zero-shot prompting. 2 General Overview The model architecture of Llama 3 is illustrated in Figure 1. The development of our Llama 3 language models comprises two main stages: • Language model pre-training. We start by converting a large, multilingual text corpus to discrete tokens and pre-training a large language model (LLM) on the resulting data to perform next-token prediction. In the language model pre-training stage, the model learns the structure of language and obtains large amounts of knowledge about the world from the text it is “reading”. To do this eﬀectively, pre-training is performed at massive scale: we pre-train a model with 405B parameters on 15.6T tokens using a context window of 8K tokens. This standard pre-training stage is followed by a continued pre-training stage that increases the supported context window to 128K tokens. See Section 3 for details. • Language model post-training. The pre-trained language model has a rich understanding of language but it does not yet follow instructions or behave in the way we would expect an assistant to. We align the model with human feedback in several rounds, each of which involves supervised ﬁnetuning (SFT) on instruction tuning data and Direct Preference Optimization (DPO; Rafailov et al., 2024). At this post-training2 stage, we also integrate new capabilities, such as tool-use, and observe strong improvements in other areas, such as coding and reasoning. See Section 4 for details. Finally, safety mitigations are also incorporated into the model at the post-training stage, the details of which are described in Section 5.4. The resulting models have a rich set of capabilities. They can answer questions in at least eight languages, write high-quality code, solve complex reasoning problems, and use tools out-of-the-box or in a zero-shot way. We also perform experiments in which we add image, video, and speech capabilities to Llama 3 using a compositional approach. The approach we study comprises the three additional stages illustrated in Figure 28: • Multi-modal encoder pre-training. We train separate encoders for images and speech. We train our image encoder on large amounts of image-text pairs. This teaches the model the relation between visual content and the description of that content in natural language. Our speech encoder is trained using a 2In this paper, we use the term “post-training” to refer to any model training that happens outside of pre-training. 3

Figure 1 Illustration of the overall architecture and training of Llama 3. Llama 3 is a Transformer language model trained to predict the next token of a textual sequence. See text for details. self-supervised approach that masks out parts of the speech inputs and tries to reconstruct the masked out parts via a discrete-token representation. As a result, the model learns the structure of speech signals. See Section 7 for details on the image encoder and Section 8 for details on the speech encoder. • Vision adapter training. We train an adapter that integrates the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image- encoder representations into the language model. The adapter is trained on text-image pairs. This aligns the image representations with the language representations. During adapter training, we also update the parameters of the image encoder but we intentionally do not update the language-model parameters. We also train a video adapter on top of the image adapter on paired video-text data. This enables the model to aggregate information across frames. See Section 7 for details. • Speech adapter training. Finally, we integrate the speech encoder into the model via an adapter that converts speech encodings into token representations that can be fed directly into the ﬁnetuned language model. The parameters of the adapter and encoder are jointly updated in a supervised ﬁnetuning stage to enable high-quality speech understanding. We do not change the language model during speech adapter training. We also integrate a text-to-speech system. See Section 8 for details. Our multimodal experiments lead to models that can recognize the content of images and videos, and support interaction via a speech interface. These models are still under development and not yet ready for release. 3 Pre-Training Language model pre-training involves: (1) the curation and ﬁltering of a large-scale training corpus, (2) the development of a model architecture and corresponding scaling laws for determining model size, (3) the development of techniques for eﬃcient pre-training at large scale, and (4) the development of a pre-training recipe. We present each of these components separately below. 3.1 Pre-Training Data We create our dataset for language model pre-training from a variety of data sources containing knowledge until the end of 2023. We apply several de-duplication methods and data cleaning mechanisms on each data source to obtain high-quality tokens. We remove domains that contain large amounts of personally identiﬁable information (PII), and domains with known adult content. 3.1.1 Web Data Curation Much of the data we utilize is obtained from the web and we describe our cleaning process below. PII and safety filtering. Among other mitigations, we implement ﬁlters designed to remove data from websites are likely to contain unsafe content or high volumes of PII, domains that have been ranked as harmful according to a variety of Meta safety standards, and domains that are known to contain adult content. 4

Text extraction and cleaning. We process the raw HTML content for non-truncated web documents to extract high-quality diverse text. To do so, we build a custom parser that extracts the HTML content and optimizes for precision in boilerplate removal and content recall. We evaluate our parser’s quality in human evaluations, comparing it with popular third-party HTML parsers that optimize for article-like content, and found it to perform favorably. We carefully process HTML pages with mathematics and code content to preserve the structure of that content. We maintain the image alt attribute text since mathematical content is often represented as pre-rendered images where the math is also provided in the alt attribute. We experimentally evaluate diﬀerent cleaning conﬁgurations. We ﬁnd markdown is harmful to the performance of a model that is primarily trained on web data compared to plain text, so we remove all markdown markers. De-duplication. We apply several rounds of de-duplication at the URL, document, and line level: • URL-level de-duplication. We perform URL-level de-duplication across the entire dataset. We keep the most recent version for pages corresponding to each URL. • Document-level de-duplication. We perform global MinHash (Broder, 1997) de-duplication across the entire dataset to remove near duplicate documents. • Line-level de-duplication. We perform aggressive line-level de-duplication similar to ccNet (Wenzek et al., 2019). We remove lines that appeared more than 6 times in each bucket of 30M documents. Although our manual qualitative analysis showed that the line-level de-duplication removes not only leftover boilerplate from various websites such as navigation menus, cookie warnings, but also frequent high-quality text, our empirical evaluations showed strong improvements. Heuristic filtering. We develop heuristics to remove additional low-quality documents, outliers, and documents with excessive repetitions. Some examples of heuristics include: • We use duplicated n-gram coverage ratio (Rae et al., 2021) to remove lines that consist of repeated content such as logging or error messages. Those lines could be very long and unique, hence cannot be ﬁltered by line-dedup. • We use “dirty word” counting (Raﬀel et al., 2020) to ﬁlter out adult websites that are not covered by domain block lists. • We use a token-distribution Kullback-Leibler divergence to ﬁlter out documents containing excessive numbers of outlier tokens compared to the training corpus distribution. Model-based quality filtering. Further, we experiment with applying various model-based quality classiﬁers to sub-select high-quality tokens. These include using fast classiﬁers such as fasttext (Joulin et al., 2017) trained to recognize if a given text would be referenced by Wikipedia (Touvron et al., 2023a), as well as more compute-intensive Roberta-based classiﬁers (Liu et al., 2019a) trained on Llama 2 predictions. To train a quality classiﬁer based on Llama 2, we create a training set of cleaned web documents, describe the quality requirements, and instruct Llama 2’s chat model to determine if the documents meets these requirements. We use DistilRoberta (Sanh et al., 2019) to generate quality scores for each document for eﬃciency reasons. We experimentally evaluate the eﬃcacy of various quality ﬁltering conﬁgurations. Code and reasoning data. Similar to DeepSeek-AI et al. (2024), we build domain-speciﬁc pipelines that extract code and math-relevant web pages. Speciﬁcally, both the code and reasoning classiﬁers are DistilledRoberta models trained on web data annotated by Llama 2. Unlike the general quality classiﬁer mentioned above, we conduct prompt tuning to target web pages containing math deduction, reasoning in STEM areas and code interleaved with natural language. Since the token distribution of code and math is substantially diﬀerent than that of natural language, these pipelines implement domain-speciﬁc HTML extraction, customized text features and heuristics for ﬁltering. Multilingual data. Similar to our processing pipelines for English described above, we implement ﬁlters to remove data from websites that are likely to contain PII or unsafe content. Our multilingual text processing pipeline has several unique features: • We use a fasttext-based language identiﬁcation model to categorize documents into 176 languages. • We perform document-level and line-level de-duplication within data for each language. 5

• We apply language-speciﬁc heuristics and model-based ﬁlters to remove low-quality documents. In addition, we perform quality ranking of multilingual documents using a multilingual Llama 2-based classiﬁer to ensure that high-quality content is prioritized. We determine the amount of multilingual tokens used in pre-training experimentally, balancing model performance on English and multilingual benchmarks. 3.1.2 Determining the Data Mix To obtain a high-quality language model, it is essential to carefully determine the proportion of diﬀerent data sources in the pre-training data mix. Our main tools in determining this data mix are knowledge classiﬁcation and scaling law experiments. Knowledge classification. We develop a classiﬁer to categorize the types of information contained in our web data to more eﬀectively determine a data mix. We use this classiﬁer to downsample data categories that are over-represented on the web, for example, arts and entertainment. Scaling laws for data mix. To determine the best data mix, we perform scaling law experiments in which we train several small models on a data mix and use that to predict the performance of a large model on that mix (see Section 3.2.1). We repeat this process multiple times for diﬀerent data mixes to select a new data mix candidate. Subsequently, we train a larger model on this candidate data mix and evaluate the performance of that model on several key benchmarks. Data mix summary. Our ﬁnal data mix contains roughly 50% of tokens corresponding to general knowledge, 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens. 3.1.3 Annealing Data Empirically, we ﬁnd that annealing (see Section 3.4.3) on small amounts of high-quality code and mathematical data can boost the performance of pre-trained models on key benchmarks. Akin to Li et al. (2024b), we perform annealing with a data mix that upsamples high-quality data in select domains. We do not include any training sets from commonly used benchmarks in our annealing data. This enables us to assess the true few-shot learning capabilities and out-of-domain generalization of Llama 3. Following OpenAI (2023a), we evaluate the eﬃcacy of annealing on the GSM8k (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021b) training sets in annealing. We ﬁnd that annealing improved the performance of a pre-trained Llama 3 8B model on the GSM8k and MATH validation sets by 24.0% and 6.4%, respectively. However, the improvements on the 405B model are negligible, suggesting that our ﬂagship model has strong in-context learning and reasoning capabilities and does not require speciﬁc in-domain training samples to obtain strong performance. Using annealing to assess data quality. Similar to Blakeney et al. (2024), we ﬁnd that annealing enables us to judge the value of small domain-speciﬁc datasets. We measure the value of such datasets by annealing the learning rate of a 50% trained Llama 3 8B model linearly to 0 on 40B tokens. In those experiments, we assign 30% weight to the new dataset and the remaining 70% weight to the default data mix. Using annealing to evaluate new data sources is more eﬃcient than performing scaling law experiments for every small dataset. 3.2 Model Architecture Llama 3 uses a standard, dense Transformer architecture (Vaswani et al., 2017). It does not deviate signiﬁcantly from Llama and Llama 2 (Touvron et al., 2023a,b) in terms of model architecture; our performance gains are primarily driven by improvements in data quality and diversity as well as by increased training scale. We do make a few smaller modiﬁcations compared to Llama 3: • We use grouped query attention (GQA; Ainslie et al. (2023)) with 8 key-value heads to improve inference speed and to reduce the size of key-value caches during decoding. • We use an attention mask that prevents self-attention between diﬀerent documents within the same sequence. We ﬁnd that this change had limited impact during in standard pre-training, but ﬁnd it to be important in continued pre-training on very long sequences. 6

8B 32 4,096 6,144 32 8 3 × 10−4 Layers Model Dimension FFN Dimension Attention Heads Key/Value Heads Peak Learning Rate Activation Function Vocabulary Size Positional Embeddings 70B 80 8192 12,288 64 8 405B 126 16,384 20,480 128 8 8 × 10−5 1.5 × 10−4 SwiGLU 128,000 RoPE (θ = 500, 000) Table 3 Overview of the key hyperparameters of Llama 3. We display settings for 8B, 70B, and 405B language models. • We use a vocabulary with 128K tokens. Our token vocabulary combines 100K tokens from the tiktoken3 tokenizer with 28K additional tokens to better support non-English languages. Compared to the Llama 2 tokenizer, our new tokenizer improves compression rates on a sample of English data from 3.17 to 3.94 characters per token. This enables the model to “read” more text for the same amount of training compute. We also found that adding 28K tokens from select non-English languages improved both compression ratios and downstream performance, with no impact on English tokenization. • We increase the RoPE base frequency hyperparameter to 500,000. This enables us to better support longer contexts; Xiong et al. (2023) showed this value to be eﬀective for context lengths up to 32,768. Llama 3 405B uses an architecture with 126 layers, a token representation dimension of 16,384, and 128 attention heads; see Table 3 for details. This leads to a model size that is approximately compute-optimal according to scaling laws on our data for our training budget of 3.8 × 1025 FLOPs. 3.2.1 Scaling Laws We develop scaling laws (Hoﬀmann et al., 2022; Kaplan et al., 2020) to determine the optimal model size for our ﬂagship model given our pre-training compute budget. In addition to determining the optimal model size, a major challenge is to forecast the ﬂagship model’s performance on downstream benchmark tasks, due to a couple of issues: (1) Existing scaling laws typically predict only next-token prediction loss rather than speciﬁc benchmark performance. (2) Scaling laws can be noisy and unreliable because they are developed based on pre-training runs conducted with small compute budgets (Wei et al., 2022b). To address these challenges, we implement a two-stage methodology to develop scaling laws that accurately predict downstream benchmark performance: 1. We ﬁrst establish a correlation between the compute-optimal model’s negative log-likelihood on down- stream tasks and the training FLOPs. 2. Next, we correlate the negative log-likelihood on downstream tasks with task accuracy, utilizing both the scaling law models and older models trained with higher compute FLOPs. In this step, we speciﬁcally leverage the Llama 2 family of models. This approach enables us to predict downstream task performance given a speciﬁc number of training FLOPs for compute-optimal models. We use a similar method to select our pre-training data mix (see Section 3.4). Scaling law experiments. Concretely, we construct our scaling laws by pre-training models using compute budgets between 6 × 1018 FLOPs and 1022 FLOPs. At each compute budget, we pre-train models ranging in size between 40M and 16B parameters, using a subset of model sizes at each compute budget. In these training runs, we use a cosine learning rate schedule with a linear warmup for 2,000 training steps. The peak learning rate is set between 2 × 10−4 and 4 × 10−4 depending on the size of the model. We set the cosine decay to 0.1 of the peak value. The weight decay at each step is set to 0.1 times the learning rate at that step. We use a ﬁxed batch size for each compute scale, ranging between 250K and 4M. 3https://github.com/openai/tiktoken/tree/main 7

Figure 2 Scaling law IsoFLOPs curves between 6 × 1018 and 1022 FLOPs. The loss is the negative log- likelihood on a held-out validation set. We approx- imate measurements at each compute scale using a second degree polynomial. Figure 3 Number of training tokens in identified compute- optimal models as a function of pre-training compute budget. We include the ﬁtted scaling-law prediction as well. The compute-optimal models correspond to the parabola minimums in Figure 2. These experiments give rise to the IsoFLOPs curves in Figure 2. The loss in these curves is measured on a separate validation set. We ﬁt the measured loss values using a second-degree polynomial and identify the minimums of each parabola. We refer to minimum of a parabola as the compute-optimal model at the corresponding pre-training compute budget. We use the compute-optimal models we identiﬁed this way to predict the optimal number of training tokens for a speciﬁc compute budget. To do so, we assume a power-law relation between compute budget, C, and the optimal number of training tokens, N (C): N (C) = AC α. We ﬁt A and α using the data from Figure 2. We ﬁnd that (α, A) = (0.53, 0.29); the corresponding ﬁt is shown in Figure 3. Extrapolation of the resulting scaling law to 3.8 × 1025 FLOPs suggests training a 402B parameter model on 16.55T tokens. An important observation is that IsoFLOPs curves become ﬂatter around the minimum as the compute budget increases. This implies that performance of the ﬂagship model is relatively robust to small changes in the trade-oﬀ between model size and training tokens. Based on this observation, we ultimately decided to train a ﬂagship model with 405B parameters. Predicting performance on downstream tasks. We use the resulting compute-optimal models to forecast the performance of the ﬂagship Llama 3 model on benchmark data sets. First, we linearly correlate the (normalized) negative log-likelihood of correct answer in the benchmark and the training FLOPs. In this analysis, we use only the scaling law models trained up to 1022 FLOPs on the data mix described above. Next, we establish a sigmoidal relation between the log-likelihood and accuracy using both the scaling law models and Llama 2 models, which were trained using the Llama 2 data mix and tokenizer. We show the results of this experiment on the ARC Challenge benchmark in Figure 4). We ﬁnd this two-step scaling law prediction, which extrapolates over four orders of magnitude, to be quite accurate: it only slightly underestimates the ﬁnal performance of the ﬂagship Llama 3 model. 3.3 Infrastructure, Scaling, and Efficiency We describe our hardware and infrastructure that powered Llama 3 405B pre-training at scale and discuss several optimizations that leads to improvements in training eﬃciency. 3.3.1 Training Infrastructure The Llama 1 and 2 models were trained on Meta’s AI Research SuperCluster (Lee and Sengupta, 2022). As we scaled further, the training for Llama 3 was migrated to Meta’s production clusters (Lee et al., 2024).This 8 101010111012Training Tokens0.700.750.800.850.900.95Validation LossCompute6e181e193e196e191e203e206e201e213e211e221019102010211022Compute (FLOPs)10101011Training TokensFitted Line, = 0.537, A = 0.299

分享到：

赞收藏

资料库

Meta大模型论文：The Llama 3 Herd of Models.pdf

相关推荐

最新资料