Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix build train valid test datasets #8823

Merged

Conversation

JunnYu
Copy link
Member

@JunnYu JunnYu commented Jul 29, 2024

PR types

Bug fixes

PR changes

APIs

Description

同步/~https://github.com/NVIDIA/NeMo/blob/72f630d087d45655b1a069dc72debf01dfdbdb2d/nemo/collections/nlp/data/language_modeling/megatron/gpt_dataset.py#L74-L80 更新后的代码。
获取sample数量的时候需要使用原始的数量,而不是进行扩充后的数量。

import numpy as np  
  
def build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, size, verbose):  
    """  
    Given multiple datasets and a weighting array, build samples such that it follows those weights.  
      
    Parameters:  
    - dataset_index: NumPy array to store the dataset index for each sample.  
    - dataset_sample_index: NumPy array to store the sample index within each dataset.  
    - weights: NumPy array of weights for each dataset.  
    - num_datasets: Integer, the number of datasets.  
    - size: Integer, the total number of samples to generate.  
    - verbose: Boolean, whether to print verbose output.  
    """  
    if verbose:  
        print("> building indices for blendable datasets ...")  
  
    # Initialize buffer for number of samples used for each dataset.  
    current_samples = np.zeros(num_datasets, dtype=np.int64)  
  
    # For each sample:  
    for sample_idx in range(size):  
        # Determine where the max error in sampling is happening.  
        sample_idx_double = max(sample_idx, 1)  
        max_error_index = 0  
        max_error = weights[0] * sample_idx_double - current_samples[0]  
        for dataset_idx in range(1, num_datasets):  
            error = weights[dataset_idx] * sample_idx_double - current_samples[dataset_idx]  
            if error > max_error:  
                max_error = error  
                max_error_index = dataset_idx  

        # Populate the indices.  
        dataset_index[sample_idx] = max_error_index  
        dataset_sample_index[sample_idx] = current_samples[max_error_index]  
        # Update the total samples.  
        current_samples[max_error_index] += 1  
  
    # Print info  
    if verbose:  
        print(" > sample ratios:")  
        for dataset_idx in range(num_datasets):  
            ratio = current_samples[dataset_idx] / size  
            print(f"   dataset {dataset_idx}, input: {weights[dataset_idx]}, achieved: {ratio}")  

weights = [6.76142772e-01, 4.65481872e-03, 1.50378956e-02, 1.98387035e-04,
    5.06176985e-03, 6.97636962e-04, 3.13866567e-03, 3.20165998e-02,
    3.60524667e-03, 6.90657464e-03, 2.26846735e-02, 7.34873296e-03,
    3.92887512e-05, 3.91225911e-03, 1.01479806e-02, 2.12045055e-03,
    7.26073523e-03, 1.52476576e-02, 4.77683574e-03, 6.46679117e-02,
    4.21797692e-02, 6.46304351e-02, 5.13122989e-03, 1.99474356e-03,
    5.01820438e-06, 3.14517551e-05, 7.83280489e-05, 7.54838022e-05,
    1.34179804e-04, 7.24675664e-05]

num_datasets = len(weights)

expanded_size = 4347
verbose = False  
dataset_index = np.zeros(expanded_size, dtype=np.uint8)
dataset_sample_index = np.zeros(expanded_size, dtype=np.int64)
build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, expanded_size, verbose)
# dataset 0的 total number of samples:   2548
print("扩充后", dataset_sample_index[dataset_index==0], "超过了dataset 0的 total number of samples:   2548")

raw_size = 3727
dataset_index = np.zeros(raw_size, dtype=np.uint8)
dataset_sample_index = np.zeros(raw_size, dtype=np.int64)
build_blending_indices(dataset_index, dataset_sample_index, weights, num_datasets, raw_size, verbose)
# dataset 0的 total number of samples:   2548
print("扩充前", dataset_sample_index[dataset_index==0], "没有超过dataset 0的 total number of samples:   2548")
# 扩充后 [   0    1    2 ... 2936 2937 2938] 超过了dataset 0的 total number of samples:   2548
# 扩充前 [   0    1    2 ... 2516 2517 2518] 没有超过dataset 0的 total number of samples:   2548

Copy link

paddle-bot bot commented Jul 29, 2024

Thanks for your contribution!

@JunnYu JunnYu requested a review from ZHUI July 29, 2024 08:22
Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JunnYu JunnYu merged commit 31c3b55 into PaddlePaddle:release/2.8 Jul 29, 2024
4 of 5 checks passed
DesmonDay pushed a commit to DesmonDay/PaddleNLP that referenced this pull request Sep 5, 2024
DesmonDay added a commit that referenced this pull request Sep 5, 2024
* quick fix from pretrained. (#8487)

* quick fix os.path.split (#8508)

* Cp/fix (#8569)

* [Safetensors] Fix fast safe open slice. (#8512)
* [FIX DDP] fix ddp (#8549)

* [BUG] Fix build train valid test datasets (#8823)

* Update causal_dataset.py

* Add twenty redundant data in post pretrain (#8777)

* 给dataset再添加20条数据,防止blend dataset出现错误

* num_samples向下去整,防止数据集的溢出 (#8691)

* update release_grads (#8834)

* update release_grads (#8834)

* [Trainer] Fix release_grads (#9085)

* fix pp release_grads

* add dataloader_drop_last to evaldataloader (#8773)

* bugfix

* Fix eval hang (#9052)

* fix pipeline eval

* fix eval dataloader_num_workers

---------

Co-authored-by: Zhong Hui <zhonghui.net@gmail.com>
Co-authored-by: yujun <50394665+JunnYu@users.noreply.github.com>
Co-authored-by: gongel <ainlp88@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants