[BUG] Fix build train valid test datasets #8823

JunnYu · 2024-07-29T08:15:13Z

PR types

Bug fixes

PR changes

APIs

Description

同步/~https://github.com/NVIDIA/NeMo/blob/72f630d087d45655b1a069dc72debf01dfdbdb2d/nemo/collections/nlp/data/language_modeling/megatron/gpt_dataset.py#L74-L80 更新后的代码。
获取sample数量的时候需要使用原始的数量，而不是进行扩充后的数量。

import numpy as np  
  
def build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, size, verbose):  
    """  
    Given multiple datasets and a weighting array, build samples such that it follows those weights.  
      
    Parameters:  
    - dataset_index: NumPy array to store the dataset index for each sample.  
    - dataset_sample_index: NumPy array to store the sample index within each dataset.  
    - weights: NumPy array of weights for each dataset.  
    - num_datasets: Integer, the number of datasets.  
    - size: Integer, the total number of samples to generate.  
    - verbose: Boolean, whether to print verbose output.  
    """  
    if verbose:  
        print("> building indices for blendable datasets ...")  
  
    # Initialize buffer for number of samples used for each dataset.  
    current_samples = np.zeros(num_datasets, dtype=np.int64)  
  
    # For each sample:  
    for sample_idx in range(size):  
        # Determine where the max error in sampling is happening.  
        sample_idx_double = max(sample_idx, 1)  
        max_error_index = 0  
        max_error = weights[0] * sample_idx_double - current_samples[0]  
        for dataset_idx in range(1, num_datasets):  
            error = weights[dataset_idx] * sample_idx_double - current_samples[dataset_idx]  
            if error > max_error:  
                max_error = error  
                max_error_index = dataset_idx  

        # Populate the indices.  
        dataset_index[sample_idx] = max_error_index  
        dataset_sample_index[sample_idx] = current_samples[max_error_index]  
        # Update the total samples.  
        current_samples[max_error_index] += 1  
  
    # Print info  
    if verbose:  
        print(" > sample ratios:")  
        for dataset_idx in range(num_datasets):  
            ratio = current_samples[dataset_idx] / size  
            print(f"   dataset {dataset_idx}, input: {weights[dataset_idx]}, achieved: {ratio}")  

weights = [6.76142772e-01, 4.65481872e-03, 1.50378956e-02, 1.98387035e-04,
    5.06176985e-03, 6.97636962e-04, 3.13866567e-03, 3.20165998e-02,
    3.60524667e-03, 6.90657464e-03, 2.26846735e-02, 7.34873296e-03,
    3.92887512e-05, 3.91225911e-03, 1.01479806e-02, 2.12045055e-03,
    7.26073523e-03, 1.52476576e-02, 4.77683574e-03, 6.46679117e-02,
    4.21797692e-02, 6.46304351e-02, 5.13122989e-03, 1.99474356e-03,
    5.01820438e-06, 3.14517551e-05, 7.83280489e-05, 7.54838022e-05,
    1.34179804e-04, 7.24675664e-05]

num_datasets = len(weights)

expanded_size = 4347
verbose = False  
dataset_index = np.zeros(expanded_size, dtype=np.uint8)
dataset_sample_index = np.zeros(expanded_size, dtype=np.int64)
build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, expanded_size, verbose)
# dataset 0的 total number of samples:   2548
print("扩充后", dataset_sample_index[dataset_index==0], "超过了dataset 0的 total number of samples:   2548")

raw_size = 3727
dataset_index = np.zeros(raw_size, dtype=np.uint8)
dataset_sample_index = np.zeros(raw_size, dtype=np.int64)
build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, raw_size, verbose)
# dataset 0的 total number of samples:   2548
print("扩充前", dataset_sample_index[dataset_index==0], "没有超过dataset 0的 total number of samples:   2548")
# 扩充后 [   0    1    2 ... 2936 2937 2938] 超过了dataset 0的 total number of samples:   2548
# 扩充前 [   0    1    2 ... 2516 2517 2518] 没有超过dataset 0的 total number of samples:   2548

paddle-bot · 2024-07-29T08:15:17Z

Thanks for your contribution!

ZHUI

LGTM

* quick fix from pretrained. (#8487) * quick fix os.path.split (#8508) * Cp/fix (#8569) * [Safetensors] Fix fast safe open slice. (#8512) * [FIX DDP] fix ddp (#8549) * [BUG] Fix build train valid test datasets (#8823) * Update causal_dataset.py * Add twenty redundant data in post pretrain (#8777) * 给dataset再添加20条数据,防止blend dataset出现错误 * num_samples向下去整,防止数据集的溢出 (#8691) * update release_grads (#8834) * update release_grads (#8834) * [Trainer] Fix release_grads (#9085) * fix pp release_grads * add dataloader_drop_last to evaldataloader (#8773) * bugfix * Fix eval hang (#9052) * fix pipeline eval * fix eval dataloader_num_workers --------- Co-authored-by: Zhong Hui <zhonghui.net@gmail.com> Co-authored-by: yujun <50394665+JunnYu@users.noreply.github.com> Co-authored-by: gongel <ainlp88@qq.com>

update

8d8c602

JunnYu requested a review from ZHUI July 29, 2024 08:22

ZHUI approved these changes Jul 29, 2024

View reviewed changes

JunnYu merged commit 31c3b55 into PaddlePaddle:release/2.8 Jul 29, 2024
4 of 5 checks passed

DesmonDay pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Sep 5, 2024

[BUG] Fix build train valid test datasets (PaddlePaddle#8823)

dd31fe4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix build train valid test datasets #8823

[BUG] Fix build train valid test datasets #8823

JunnYu commented Jul 29, 2024

paddle-bot bot commented Jul 29, 2024

ZHUI left a comment

[BUG] Fix build train valid test datasets #8823

[BUG] Fix build train valid test datasets #8823

Conversation

JunnYu commented Jul 29, 2024

PR types

PR changes

Description

paddle-bot bot commented Jul 29, 2024

ZHUI left a comment

Choose a reason for hiding this comment