Releases: PaddlePaddle/PaddleNLP
v3.0.0-beta3
本次更新增强了PaddleNLP的基础体验,新增了Llama-3.2、DeepSeekV2模型,升级了TokenizerFast功能,重构了SFTTrainer。
此外,PaddleNLP还支持了优化器状态的卸载和重载功能,实现了精细化的重新计算,训练性能提升7%。在Unified Checkpoint方面,进一步优化了异步保存逻辑,新增Checkpoint压缩功能,可节省78.5%存储空间。
最后,在大模型推理、自动并行、多硬件支持、文档使用上,我们都进行了深度优化。
主要更新与增强
-
新增模型:
-
基础架构改进:
-
推理性能提升:
-
硬件兼容性扩展:
-
自动并行优化:
-
文档和测试更新:
本次更新标志着PaddleNLP的持续进步,为用户提供了更加全面、高效和稳定的NLP解决方案。我们期待在未来的版本中,继续为用户带来更多的创新和价值。
What's Changed
- [Unified Checkpoint] update async_save_info in develop by @DesmonDay in #9173
- add flashmask rm by @lugimzzz in #9154
- [LLM_INFER] Support quantized model from bos and fix docs by @yuanlehome in #9197
- fix ci not set no_proxy and modify tests in pir mode by @fightfat in #9205
- [Models] Add Llama-3.2 by @DrownFish19 in #9199
- move some auto_parallel args into class AutoTrainingArguments by @Wennie396 in #9155
- [Performance] Compatible with flashmask API rename upgrade by @GuoxiaWang in #9019
- [AutoParallel] add vpp align and pp amp test by @AndSonder in #9176
- fix auto ci return bug when run in v100 by @fightfat in #9216
- fix auto ci return bug when run in v100 by @AndSonder in #9228
- [LLM] Add tools for parameters by @Hanyonggong in #9137
- [AutoParallel] Add test for fuse_ffn and fuse_attention_qkv pass by @zhangbo9674 in #9203
- [CI] Fix ci import. by @ZHUI in #9239
- [Version] Update version info by @DrownFish19 in #9241
- [Auto Parallel] Adding align mode support by @zhangyuqin1998 in #9150
- [LLM INFER] top_p_sampling_reject support top_p=0 and custom seed by @gzy19990617 in #9202
- [INFER] update tune_cublaslt_gemm op and fix some bugs by @yuanlehome in #9222
- Reduce the time spent on git downloading third-party libraries by @vivienfanghuagood in #9246
- [PIR] fix pir open bugs by @yuanlehome in #9248
- Cherry-pick some PRs from incubate/paddlenlp-fleety by @sneaxiy in #9245
- [Unified Checkpoint] Support expert parallel by @DesmonDay in #9055
- [PIR] fix pir dt2st for chatglm_v2 by @yuanlehome in #9251
- Cherry-pick some PRs from incubate/paddlenlp-fleety by @LiYuRio in #9253
- [Unified Checkpoint] Fix generation config save by @DrownFish19 in #9223
- [AutoParallel] Fix tests for pass paddle AutoParallel CI by @liym27 in #9267
- change dataset by @lugimzzz in #9266
- [Unified Checkpoint] update async save logic by @DesmonDay in #9274
- add config file for model chatglm2,gemma,yuan by @Mangodadada in #9139
- Fix async hang by @DesmonDay in #9276
- [AutoParallel] Change llama test from sharding stage2 to stage1 by @zhangbo9674 in #9281
- [Tokenizer] Enable padding_side as call time kwargs by @DrownFish19 in #9258
- [Trainer] fix save_model by @DesmonDay in #9286
- [CI] Skip inference test cases by @DrownFish19 in #9270
- [LLM] Add deepseekv2 by @DrownFish19 in #9250
- [Tokenizer] Unify tokenizer _pad by @DrownFish19 in #9280
- [CI] Fix llm/alignment/rm/flashmask path by @DrownFish19 in #9289
- support attention mask using causal=True by @GuoxiaWang in #9268
- [FlashMask] Add FlashMask for Qwen2 by @DrownFish19 in #9264
- bug fix for xpu_parallel_matmul by @FeixLiu in #9297
- fix lora sharding v2 by @lugimzzz in #9300
- [LLM INFER] Append attn by @yuanlehome in #9244
- [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bu… by @zhangyuqin1998 in #9217
- [Tokenizer] Fix TokenizerFast missing clean_up_tokenization_spaces by @dynamicheart in #9304
- clean llama static modeling file by @zhiqiu in #9301
- [Unified Checkpoint] Accelerate loading checkpoint by multi-thread by @Crystal-X-111 in #9034
- fix non-pipelinelayer to distributed by @gongel in #9310
- change the legacy to slm by @wawltor in #9311
- [TRL] Rename sft trainer. by @ZHUI in #9292
- [XPU] support unified ckpt function by @cqulilujia in #9312
- [LLM INFER] Fix some bugs and chatglm_v2 support block_attn by @yuanlehome in #9271
- [Readme] Add flash mask by @lugimzzz in #9219
- update llm infer docs by @yuanlehome in #9314
- [Unified Checkpoint] Add split param and refactor code by @DesmonDay in #9240
- [METAX] Support llama for MX C550 by @idontkonwher in #9186
- update QR code by @DrownFish19 in #9325
- add flash_attention on model chatglm_v2 by @Mangodadada in #9296
- fix readme by @Mangodadada in #9326
- [Unified Checkpoint] update non-merge checkpoint loading, move async_save_info.json location by @DesmonDay in #9321
- [paddle cpu inference]fix cpu doc by @bukejiyu in #9299
- [LLM INFER] add rope_theta for block_multihead_attention by @yuanlehome in #9334
- Fix pr 9334 by @yuanlehome in #9335
- fix parameter calculation in auto_parallel mode by @zhiqiu in #9327
- [Docs] Update flashmask by @DrownFish19 in #9330
- Update load_save_single_card.py by @DesmonDay in #9337
- Update README.md by @DrownFish19 in #9339
- [Tokenizer] Support reading Tiktoken tokenizer.model. by @lvdongyi in #9215
- align default custom black/white list for dygraph and static graph by @zhiqiu in #9340
- [intel_hpu] initial commit for intel_hpu support by @yanfeich in #9273
- Compatible with Tensor.to change to out_of_place. by @DrownFish19 in https://github.co...
v3.0.0-beta2
本次更新强化了PaddleNLP的基础设施,新增了Qwen2.5、Mixtral 8*22B模型并升级了Tokenizer功能,同时重命名了数据索引工具。
此外,还修复了MoE模型参数保存与加载等问题,提升了文本处理准确性,并更新了文档与测试用例。在推理性能、硬件支持及自动并行方面也进行了优化,包括支持更多模型与参数配置、多GPU推理、国产硬件支持增强以及分布式训练流程优化等。
核心变更与增强功能
-
基础设施强化:
-
问题修复:
-
文档与测试更新:
-
其他关键变更:
- 推理性能优化:
- 硬件支持拓展:
- 自动并行优化:
What's Changed
- [Unified checkpoint] update optimizer async save signal by @DesmonDay in #8975
- 更正run_dpo.py文件路径 by @Mangodadada in #8952
- fix the loss base in llama_align_dygraph_dy2st_auto_bs2_bf16_DP2-MP1-… by @winter-wang in #8986
- [Bug fix] fix skip consumed_samples twice bug by @zhangyuqin1998 in #8980
- fix pip error in legacy benchmarks by @fightfat in #8978
- 【auto_parallel】Add checkpoint convertor by @xingmingyyj in #8847
- [llm]update finetune.md by @lugimzzz in #8990
- tool_helpers升级后可以支持32766个数据集. by @JunnYu in #8994
- add DCU inference docs by @YanhuiDua in #8983
- [Distributed]Add loss nan/inf checker by @ForFishes in #8943
- 【llm】update docs by @lugimzzz in #8999
- [Feature] Fused Mixtral support by @penPenf28 in #8901
- [XPU] Add README.md for llama2-7b by @xiguapipi in #8979
- Add gcu llama readme by @EnflameGCU in #8950
- fix qwen model use_casual_mask by @deepllz in #9009
- [ZeroPadding] revert zero_padding #8973 by @DrownFish19 in #9003
- [LLM Inference] Fix step.cu bug by @yuanlehome in #8995
- Refine checkpoint converter by @zhangbo9674 in #9001
- [Feature] fused mixtral wint4 by @penPenf28 in #9013
- llm inference docs by @Sunny-bot1 in #8976
- [LLM Inference] Support Qwen2_Moe Inference Model by @CJ77Qi in #8892
- fix llama3 static run by @yuanlehome in #8849
- [paddle inference cpu]update cpu inference by @bukejiyu in #8984
- fix the tipc ce case by @wawltor in #8748
- [Cherry-pick] Add is_distributed field in sharding reshard param_meta by @sneaxiy in #9028
- [Tokenizer] Support for loading added_tokens_decoder by @DrownFish19 in #8997
- [Inference] Add a8w8(fp8) a8w8c8(int8) quant_type support by @lixcli in #9032
- Fix checker of nan/inf by @ForFishes in #9029
- [Cherry-pick] add comm buffer size (#8963) by @ForFishes in #9031
- [Unified Checkpoint] Update async save info by @DesmonDay in #8982
- [llm]support pad to max_length & fix sp bug by @lugimzzz in #9040
- [Bugfix] fix bias optional by @penPenf28 in #9037
- fix setup.py for llm inference by @yuanlehome in #9041
- [Inference] Add cutlass gemm dequant op by @gzy19990617 in #8909
- [Inference] update fakequant support by @lixcli in #9047
- add test for pir sequence parallel on llama model by @liym27 in #9015
- Fix moe save load by @Meiyim in #9045
- Update quantization.md by @ZHUI in #9057
- 【Fix】Initialize dp degree in single GPU by @greycooker in #9056
- fix bos download by @westfish in #9023
- [Inference] Update fakequant script by @lixcli in #9054
- [AutoParallel][PIR] Fit pir grad merge by @AndSonder in #8985
- [MLU] Support rms_norm_mlu by @PeiyuLau in #8504
- [Inference] support llama3 a8w8c8_fp8 inference and cutlass_fp8_gemm by @ckl117 in #8953
- [Inference] Qwen2 support fp8 inference by @ckl117 in #8954
- [Version] update version info by @DrownFish19 in #9060
- [NPU] Fix baichuan2-13b-chat infer by @ronny1996 in #9070
- [MLU] Fix Llama attrntion_mask in npu and mlu by @DrownFish19 in #9075
- Fix the memory overflow bug of the tune_cublaslt_gemm operator by @Hanyonggong in #9076
- [Inference] Fix weight_only_int4 bug by @lixcli in #9073
- [Auto Parallel] fix data stream bug of dist.to_static by @zhangyuqin1998 in #9077
- fix hang when Flag_dataloader_use_file_descriptor=True by @deepllz in #9080
- fix llm predict install error by @fightfat in #9088
- [PIR] add pir grad merge test by @AndSonder in #9074
- Update readme by @EnflameGCU in #9046
- [LLM] Add tensor parallel for chatglmv2 by @SevenSamon in #9014
- [data] update tool_helpers version and add unittest by @JunnYu in #9093
- fix baseline because of PR#8769 by @fightfat in #9092
- fix use paddle.incubate.jit.inference(model) errors by @chang-wenbin in #9016
- [CI] Fix paddlepaddle install by @DesmonDay in #9102
- [LLM] fix train on npu by @SylarTiaNII in #9101
- Disable ut by @zhangbo9674 in #9108
- [AutoParallel] Enable CI for gradclip by @JZ-LIANG in #9059
- [Inference] Remove ceval from run_finetune by @lixcli in #9100
- [Bugfix] fix multi-gpu infer by @penPenf28 in #9107
- 【Inference】fix step kernel by @gzy19990617 in #9122
- [DCU] fix DCU w8a8c8 GEMM shape by @YanhuiDua in #9115
- [Inference] FP8 gemm auto-tune by @ckl117 in #9094
- Open ut llama_align_dygraph_dy2st_pir_auto_grad_merge_bs2_fp32_DP1-MP1-PP1 by @zhangbo9674 in #9120
- [LLM Inference] Support Qwen2_Moe Inference with MultiGPU by @CJ77Qi in #9121
- [Unified Checkpoint] Fix uc lora config, fix release_grads by @DesmonDay in #9082
- [Inference]qwen2-a8w8c8 support use_fake_parameter by @ckl117 in #9109
- Add fast_ln spmd rules by @From00 in #9125
- fix pir dtype by @wanghuancoder in #9130
- Remove ring_flash_attention warning by @DrownFish19 in #9119
- [DOC] Fix LLM page 404 Not Found by @DrRyanHuang in #9127
- Add hardware flops for pretraining by @ZHUI in #9069
- [Benchmark] Fix amp level bug in some gpt tests by @zhangbo9674 in #9116
- [Auto Parallel] Fix ckpt_converter for auto_parallel by...
v3.0.0-beta1
PaddleNLP从v3.0.0-beta0升级至v3.0.0-beta1版本,带来了多项重要更新与增强。新引入了Yuan、mamba和jamba模型,并优化了LLM推理代码,提升了兼容性和效率。
基础性能优化方面,添加了快速分词器,实现了MoE优化器参数广播,加速了层归一化。同时,修复了多个bug,包括safetensors shape切片问题和Windows下mmap问题,提升了系统稳定性和兼容性。
文档与测试方面,进行了全面更新和优化,确保了文档的准确性和代码的可读性。此外,还增强了国产硬件支持,包括DCU和XPU的优化,以及PIR模式和自动并行的配置更新。
主要变更与新增功能
1. 新模型与特性引入
- 新模型:在#8654 中引入了Yuan模型;在#8513 和#8517 中分别添加了mamba和jamba新模型,并在后续Pull Request中修复了相关bug,确保了模型的稳定运行。
- LLM推理优化:通过多个Pull Request,我们优化了LLM推理代码,并新增了对新模型和参数的支持,进一步提升了推理效率和兼容性。
2. 基础性能优化
- 快速分词器:在#8832 中,我们添加了基于
tokenizers
库的快速分词器,显著提升了分词速度和性能。 - MoE优化:在#8810 中,我们实现了MoE(Mixture of Experts)优化器参数的广播,有效增强了模型训练的效率。
- 层归一化加速:通过多个Pull Request,我们添加了fast_rmsnorm,启用了use_fast_layer_norm,并更新了基准测试配置,进一步加速了模型训练过程。特别是在#8717 中,我们支持了在微调过程中使用use_fast_layer_norm,为用户提供了更多灵活性。
- 训练性能优化:在#8803 中,我们添加了
enable_sp_async_reduce_scatter
选项,有效优化了训练性能。 - 字典参数支持:在#8446 中,我们为trainer的argparser添加了支持字典参数的新特性,增强了参数传递的灵活性。同时,在#8904 中,我们更新了tensorboard的要求,确保了与最新版本的兼容性。
3. Bug修复
- safetensors修复:在#8702 中,我们修复了safetensors的形状问题。
- Windows系统mmap修复:在#8734 中修复了mmap问题,提升了windows的兼容性。
- 其他Bug修复:包括#8687 、#8730 等多个Pull Request中的bug修复。
4. 文档与测试更新
- 文档优化:在多个Pull Request中,我们进行了文档更新、代码风格清理和版本信息更新,确保了文档的准确性和可读性。
- README修复与增强:在#8741 中,我们修复了README中的断链问题;同时,多个贡献者更新了README文档,添加了新的测试用例,确保了文档与代码的同步更新。
5. 其他重要变更
国产硬件支持增强
- DCU支持:在#8580 中,我们实现了针对DCU的高性能LLM训练和推理,拓展了PaddleNLP的硬件支持范围。
- XPU优化:在#8527 中,我们为XPU添加了LoRA优化;在#8697 和#8710 中,我们分别实现了XPU的allgather功能和修复了统一检查点的gather问题,进一步提升了XPU上的模型训练效率。
PIR模式支持
- 导出与加载优化:在#8689 中,我们修改了PIR模式下llama模型的导出方式;在#8712 和#8766 中,我们支持了以三种模式(旧IR、PIR模型文件、PIR JSON文件)加载或保存Llama2-7b模型,为用户提供了更多灵活性和兼容性。
自动并行优化
- 配置更新:在#8679 中,我们更改了Llama2-7b配置中的
max_steps
以适应自动并行;在#8767 和#8828 中,我们优化了自动训练器的保存和加载功能;在#8750 中,我们更新了全局剪切的损失函数,进一步提升了自动并行的效率和准确性。
What's Changed
- [DCU] high performance LLM train and inference for DCU by @yuguo-Jack in #8580
- fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by @fightfat in #8678
- bug fix by @wtmlon in #8687
- [XPU] add lora optimization by @dynamicheart in #8527
- [pir save] Modiy export llama model file in pir mode by @xiaoguoguo626807 in #8689
- [AutoParallel]Change
max_steps
in Llama2-7b config for auto-parallel. by @heavyrain-lzy in #8679 - [benchmark] Change the mirror source for pip by @mmglove in #8699
- update loss base of auto-parallel tests by @zhiqiu in #8701
- Add new mistral by @wtmlon in #7425
- [Safetensors] Fix safetensors shape by @DesmonDay in #8702
- [BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by @JunnYu in #8690
- xpu use allgather by @FeixLiu in #8697
- add fast_rmsnorm by @deepllz in #8680
- enable use_fast_layer_norm for llama2 benchmark by @deepllz in #8714
- fix xpu gather for unified ckpt by @FeixLiu in #8710
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8712
- fix fast_ln backward by @deepllz in #8719
- finetune support use_fast_layer_norm by @tianhaodongbd in #8717
- bug fix by @FeixLiu in #8730
- disable lora by @lugimzzz in #8674
- [Safetensors] Fix mmap for Windows system by @DrownFish19 in #8734
- correct broken links in readme by @jzhang533 in #8741
- revert benchmark fix by @ronny1996 in #8747
- [LLM] Add Yuan model by @zhaogf01 in #8654
- fix nlp dir and auto_parallel_ci exit -6 by @fightfat in #8744
- [LLM] Update sequence parallel linear import by @DrownFish19 in #8706
- [Bug fixes] Fix ring attention by @zhangyuqin1998 in #8740
- update a100 loss by @zhiqiu in #8708
- [PaddleNLP 3.0] Update README by @DrownFish19 in #8681
- [AutoParallel] update loss for global clip by @JZ-LIANG in #8750
- [NPU] Fix sequence parallel lib import by @DrownFish19 in #8760
- [DEV] Update develop version show by @DrownFish19 in #8754
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8766
- add benchmark baichuan2 scripts by @fightfat in #8683
- Add the missing truncation=True in llm/predictor.py by @lszxb in #8768
- fix the ce for the unittest by @wawltor in #8772
- Enable parallel_config to use commas as delimiters. by @Difers in #8677
- fix incorrect token counting in
llm/predictor.py
by @lszxb in #8769 - Refine savable by @ZHUI in #8758
- [CodeStyle] remove markdownlint-cli by @DrownFish19 in #8779
- [XPU] use allgather and fp32 multinomial for XPU by @houj04 in #8787
- fix version show by @DrownFish19 in #8791
- [BUG] Add 20 redundant data in post pretrain by @JunnYu in #8789
- vera-pissa method added by @TranscenderNing in #8722
- update version by @DrownFish19 in #8792
- [Inference LLM] refine some code in llama wint8/4 by @yuanlehome in #8796
- [DCU] Llama a8w8 inference performance optimization by @Deleter-D in #8800
- [Prediction] Update LLM prediction. by @DesmonDay in #8778
- [Trainer] Add enable_sp_async_reduce_scatter by @DesmonDay in #8803
- [AutoParallel] Refine auto_trainer save load by @zhangbo9674 in #8767
- [MoE] Optimizer parameter broadcast by @DesmonDay in #8810
- [Doc] Update README by @DrownFish19 in #8817
- support Llama3.1 8B 128K generation on single GPU 80GB by @GuoxiaWang in #8811
- add paddle nv-embed-v1 by @Li-Z-Q in #8785
- fix pad_token_id bug by @yuanlehome in #8814
- [DCU] fix llama inference bug on DCU by @Deleter-D in #8815
- [Doc] Add LLaMA3.1 by @DrownFish19 in #8824
- [BUG] Fix build train valid test datasets by @JunnYu in #8826
- Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by @Hanyonggong in #8799
- fix tune_cublaslt_gemm compile bug by @yuanlehome in #8844
- [AutoParallel] Refine save and load ckpt for auto_trainer by @zhangbo9674 in #8828
- [Unified Checkpoint] update merge tensor parallel by @DesmonDay in #8856
- [Trainer] update clear_grad by @DesmonDay in #8829
- [Unified Checkpoint] Fix tie_word_embeddings by @DesmonDay in #8795
- [Inference LLM] support static c8 by @yuanlehome in #8833
- support sft mapdataset by @greycooker in #8840
- Cherry pick some changes from incubate branch by @sneaxiy in #8862
- support nested list of dict inputs by @deepllz in #8876
- Fix the bug with issues code 8641. by @smallbenxiong in #8880
- Fix the issue of P-tuning official sample error by @guangyunms in #8884
- modify Paddlemix qwen dytostatic by @xiaoguoguo626807 in #8869
- [llm]fix zeropadding by @lugimzzz in #8895
- 修复fast_ln算子动半开启后报错 by @Wennie396 in #8891
- enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by @deepllz in #8897
- Update run_pretrain.py by @...
v3.0.0-beta0
很高兴地通知大家,飞桨大模型套件发布v3.0.0beat版本:拥抱大模型,体验全升级。具体工作如下:
- 统一大模型工具链,实现国产计算芯片全流程接入;
- 全面支持飞桨4D并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程;
- 自研极致收敛的RsLoRA+算法、自动扩缩容存储机制Unified Checkpoint和通用化支持FastFFN、FusedQKV助力大模型训推;
- 主流模型持续支持更新,提供高效解决方案。
大模型精调对齐训推优化
-
PEFT:
-
DPO:
-
国产芯片支持:
-
性能优化:
-
其他
- 新增模型内存监控 in #8269
模型新增
-
新增Gemma模型 in #8082
- google/gemma-7b
- google/gemma-7b-it
- google/gemma-2b
- google/gemma-2b-it
-
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3-70B
- meta-llama/Meta-Llama-3-70B-Instruct
-
新增Qwen2模型 in #8338 #8584 #8601
- Qwen/Qwen1.5-0.5B
- Qwen/Qwen1.5-0.5B-Chat
- Qwen/Qwen1.5-1.8B
- Qwen/Qwen1.5-1.8B-Chat
- Qwen/Qwen1.5-4B
- Qwen/Qwen1.5-4B-Chat
- Qwen/Qwen1.5-7B
- Qwen/Qwen1.5-7B-Chat
- Qwen/Qwen1.5-14B
- Qwen/Qwen1.5-14B-Chat
- Qwen/Qwen1.5-32B
- Qwen/Qwen1.5-32B-Chat
- Qwen/Qwen1.5-72B
- Qwen/Qwen1.5-72B-Chat
- Qwen/Qwen1.5-110B
- Qwen/Qwen1.5-110B-Chat
- Qwen/Qwen1.5-MoE-A2.7B
- Qwen/Qwen1.5-MoE-A2.7B-Chat
- Qwen/Qwen2-0.5B
- Qwen/Qwen2-0.5B-Instruct
- Qwen/Qwen2-1.5B
- Qwen/Qwen2-1.5B-Instruct
- Qwen/Qwen2-7B
- Qwen/Qwen2-7B-Instruct
- Qwen/Qwen2-72B
- Qwen/Qwen2-72B-Instruct
- Qwen/Qwen2-57B-A14B
- Qwen/Qwen2-57B-A14B-Instruct
基础框架升级
-
功能优化:
-
AutoParallel优化
-
分布式能力优化:
-
chat能力优化:
- 增加Chat template in #8226
-
其他
问题修复
- 修复sharding数量小于100的bug in #8146
- 修复TP/PP参数合并问题 in #8239
- 修复tensor.shape与paddle.shape(tensor)不一致问题 in #8260
- 修复fp16+delay_scale_loss_scale+sharding_stage1_overlap的bug in #8314
- 增加pipelines运行文档及提示 in #8292 #8308 #8202 #8353
- 修复text feature extraction任务中tokenizer输入 in #8331
- 修复import error in #8332 #8367
结构调整
PaddleNLP文件结构调整 in #8609 #8613 #8605 #8614 #8617 #8626 #8618 #8625 #8619 #8629 #8601 #8627 #8666
What's Changed
- [dist]pip requirements-dev.txt by @Liujie0926 in #8258
- add scaling by @lugimzzz in #8256
- [LLM]Support Gemma model by @Southpika in #8082
- [BugFix] Try except sequence parallel utils by @DesmonDay in #8189
- Update CodeCov GitHub Action by @sijunhe in #8268
- [AutoParallel] Open recompute strategy for llama model by @zhangbo9674 in #8265
- Fix sharding < 100 limitation bug by @sneaxiy in #8146
- use tensor.shape bug not paddle.shape(tensor) by @wanghuancoder in #8260
- [dist CI]update paddlenlp install for CI by @Liujie0926 in #8267
- [Bug Fix]Fix merge parameters in pp by @Southpika in #8239
- [LLM] add memory stats to logger of trainer by @SylarTiaNII in #8269
- Add p2p_comm_overlap for Llama-2-70b benchmark. by @Xreki in #8276
- add a100 test ground truth by @zhiqiu in #8249
- [paddle-pipelines] faq semantic search question answering reamde by @w5688414 in #8292
- [paddle-pipelines] Add pipelines documentation by @w5688414 in #8308
- Support llama-3 by @ZHUI in #8307
- [Distributed] [CustomDevices] Adapt SP on lora && polish MC2 APIs by @SylarTiaNII in #8303
- fix bug for fp16 + delay_scale_loss_scale + sharding_stage1_overlap by @FeixLiu in #8314
- [paddle-pipelines] Update mkdocs by @w5688414 in #8310
- [benchmark]update llama2_ips by @Liujie0926 in #8322
- [dist CI]fix before_hook by @Liujie0926 in #8283
- benchmark llama worker=1 by @wanghuancoder in #8305
- 【AutoParallel】Add llama2 UT for auto-parallel by @heavyrain-lzy in #8300
- Add system env log for llama test by @zhangbo9674 in #8321
- [LLM] Support fuse attention q, k, v weights by @DrownFish19 in #8202
- [Distributed] fix lora by @SylarTiaNII in #8325
- fix try import by @w5688414 in /~https://github.com/PaddlePaddle/Pa...
v2.8.1
What's Changed
- [Trainer] Fix sharding overlap bug by @DesmonDay in #8334
- [Cherry-pick] update truncate by @KB-Ding in #8375
- [BugFix] Fix llama3
eot_id
. by @ZHUI in #8373 - [Trainer] update distributed dataloader by @DesmonDay in #8426
- [BugFix] Fix load rng compatibility. by @ZHUI in #8451
- Cherry pick/fast_safe_open by @ZHUI in #8458
- 【cherry pick】adapter new type promotion rule for Paddle 2.6 by @zxcd in #8463
- Quick fix from pretrained. by @ZHUI in #8487
- Release/2.8 by @Galaxy1458 in #8437
- Fix from_pretrained
os.path.split
by @DesmonDay in #8508 - [fea] Cherry-picked MOE updates from develop by @bo-ke in #8531
- [LLM] relocate tensor_parallel_output to avoid conflict (#8419) by @DesmonDay in #8533
- Update sequence_parallel for predict by @DesmonDay in #8547
- Cp/fix by @ZHUI in #8569
- Do not save moe_group by @DesmonDay in #8570
- [Release] 2.8.1 by @ZHUI in #8636
Full Changelog: v2.8.0...v2.8.1
v2.8.0
很高兴地通知大家,飞桨大模型套件发布v2.8.0版本。这个版本中,我们深度优化套件的大模型精调对齐的能力,提升大模型套件在国产计算硬件训推能力,具体工作如下:
- 特色精调和高效对齐:提供自研极致收敛的RsLoRA+算法,大幅提升PEFT训练收敛速度以及训练效果;引入高性能生成加速到RLHF PPO算法,打破 PPO 训练中生成速度瓶颈,PPO训练性能大幅领先。
- 大模型训练提速:通用化支持 FastFNN、FusedQKV等多个大模型训练性能优化方式,大模型训练更快、更稳定。
大模型精调对齐训推优化
- 精调
- 推理
- 新增QWenVL 的静态图推理 #7808
模型新增
- 新增QWenVL 的静态图推理 #7808
- 新增Deberta,Debertav2模型 #8227
- deepset/deberta-v3-large-squad2
- microsoft/deberta-v2-xlarge
- microsoft/deberta-v3-base
- microsoft/deberta-v3-large
- microsoft/deberta-base
- 新增mixtral-of-experts #7803
- mistralai/Mixtral-8x7B-Instruct-v0.1
- mistralai/Mixtral-8x7B-v0.1
- 新增LLama3 #8315
- meta-llama/Meta-llama-3-8b
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-llama-3-70b
- meta-llama/Meta-Llama-3-70B-Instruct
基础框架升级
- Trainer升级
- AutoParallel升级
- 其他
其他支持
- 新增俄罗斯套娃(matryoshka representation learning)检索策略,节省计算和存储资源。#8165
问题修复
- 日志级别修改,并增加timelog计时日志,兼容不同设备。#8261
- 修复pipeline并行中随机初始化的shared weights不一致的问题,覆盖GPT/OPT等模型。#7772
- 关闭CI及单测中从huggingface hub下载的逻辑 #7798 #8198
- 修复llm的gradio开启chat template时候重复拼接query 和 history的问题。#7992
- 修复GPT模型下载key error问题。#8253
- 修复LlamaRotaryEmbedding #7882
- 修复allreduce dtype的问题 #7876
- 修复框架侧dev分支清理 paddle.jit.dy2static.utils_helperAPI的问题 #7989
- 修复read-data timer在ignore_data_skip=False and skip_profile_timer=False 的问题。#8177
- 修复Wandb单测问题 #8066 #8056
- 修复Trainer同时解析json与命令行列表参数报错问题#7860
- 修复Gradio UI 中的推理问题 #7740 #7788
- 修复 Tokenizer 相关的基础问题 #7797 7870
- 修复 custom devices上loading rng state的问题。#7894
- 修复自动并行打印BF16的loss编码错乱的问题#7874
- 采用float初始化模型,修复静态图自动并行AMP报错问题#8033#8199
- 修复ShardDataloader接口在PipeLine Parallelism下使用错误问题#8014
- 修复llama在custom devices的精度问题。#7895
- 修复NPU AICPU算子问题 #7976
- 修复FusedLinearWithGradAdd少传参数的问题。#8178
What's Changed
- [Unified Checkpoint] Add unified checkpoint training args doc. by @DesmonDay in #7756
- [AutoParallel] Auto Trans PP to VPP by @zhaoyinglia in #7747
- Add codecov check by @zjjlivein in #7760
- [CE] Delete gpt_for_sequence_classification by @ZHUI in #7757
- [DOC] Update trainer.md by @ZHUI in #7761
- [Release] Change version to 2.7.0 by @ZHUI in #7764
- [benchmark]close skip_memory_metrics for ips by @Liujie0926 in #7732
- [Release] Update release.yml to release tags by @ZHUI in #7765
- [AutoParallel] Add Sequence Parallel for Static LLaMA by @JZ-LIANG in #7746
- [New Features] support dynamic src_length by @wj-Mcat in #7740
- Fix unified_checkpoint bug by @DrownFish19 in #7770
- [DONE] aistudio, hf hub, bos update download by @JunnYu in #7608
- [Trainer] Fix dist dataloader eval by @DesmonDay in #7777
- [Paddle-pipelines] Update convert_files_to_dicts_splitter by @w5688414 in #7748
- [PEFT]fix lora model tp when existing other trainable module by @lugimzzz in #7781
- [Paddle-Pipelines] update faiss by @qingzhong1 in #7793
- Fix shared weights sync for PipelineLayer by @DrownFish19 in #7772
- [tests] download slow by @JunnYu in #7798
- [INFER][LLM] Support qwen in fined grained dybatch v1 by @DanGuge in #7644
- Add CE for Distributed Hybrid Parallel by @iosmers in #7782
- add MP2-SP2-pp4-vpp2-SD2-stage1-mbs2-acc8 ce by @tianhaodongbd in #7774
- [Pretrain] Fix eval during pretrain by @DesmonDay in #7806
- pipeline parallel benchmark by @zhangting2020 in #7759
- [Bug fixes] fix br gradio by @wj-Mcat in #7788
- delete useless code for write_cache_kv.cu by @yuanlehome in #7812
- [llm]support qlora pp by @lugimzzz in #7801
- Trainer support simultaneously parse JSON files and cmd arguments. by @greycooker in #7768
- [LLM] Support block_attention/cachekv quant for llama by @RichardWooSJTU in #7649
- [Bug Fix] fix paddle multipy_fwd_func warning message by @BeingGod in #7818
- [llm]fix lora by @lugimzzz in #7824
- fused rms spmd by @liuzhenhai93 in #7830
- [Pretrain] Fix eval during pretrain by @DesmonDay in #7827
- [neural search][fix bug of evaluate.py] by @ZeyuTeng96 in #7832
- [neural search] fix the bug of reading files when calculating the recall scores by @shenghwa in #7836
- [Bug fixes] update chatglm tokenizer by @wj-Mcat in #7797
- [semantic_indexing] fix bug of evaluate.py by @ZeyuTeng96 in #7843
- [faq] fix bug of evaluate.py by @ZeyuTeng96 in #7840
- [text_classification_retrieval_based] fix bug of evaluate.py by @ZeyuTeng96 in #7844
- [LLM] add Qwen-7B-Chat to PaddleNLP unit test by @ziangqin-baidu in #7823
- Support 5.2 bloom by @zhoutianzi666 in #7846
- [unified checkpoint] Fix last checkpoint save by @DrownFish19 in #7854
- [unified checkpoint] fix checkpoint names by @DrownFish19 in #7795
- [New Features]add ranks testing for test_predictor by @wj-Mcat in #7800
- [Auto Parallel] Support dynamic semi-auto training in Llama2 model by @haohongxiang in #7851
- [CI] add ci approval pipelines by @zjjlivein in #7859
- [fix] fix a bug of trainer/argparser.py by @greycooker in #7860
- [Improvement] fix ops improting in utils by @wj-Mcat in #7865
- [Add CE] Add CE for Hybrid Parallism by @iosmers in #7817
- [Unified Checkpoint] Cherry pick empty cache. by @ZHUI in #7868
- Add PPO training. by @guoshengCS in #7305
- Update reward_main.py by @wawltor in #7880
- Update ppo_main.py by @wawltor in #7881
- [LLM] revert benchmark codes by @RichardWooSJTU in #7871
- [LLM]support QWenVL second part by @DanGuge in #7808
- [Bug Fixes] update chatglm1 tokenizer by @wj-Mcat in #7870
- 【AutoParallel】Support 'master_grad' in Llama in static auto-parallelism by @heavyrain-lzy in #7658
- [Bug Fix] fix slice bug in LlamaRotaryEmbedding by @MarioLulab in #7882
- 【AutoParallel】Support bf16 loss in static by @heavyrain-lzy in #7874
- [Bug Fix] fix allreduce tensor dtype by @BeingGod in #7876
- [CE] Add Qwen into CE process by @ziangqin-baidu in #7887
- [Hackathon 5th No.73] ToT by @ErnestinaQiu in #7660
- [CustomDevice] fix loading rng state on custom devices by @SylarTiaNII in #7894
- [LLM] ...
v2.7.2
本版本做了一些小问题的修复
What's Changed
- [Unified Checkpoint] fix checkpoint names by @DrownFish19 in #7794
- [Unified Checkpoint] Fix last checkpoint save by @DrownFish19 in #7810
- [PEFT] Cherry pick lora fix by @lugimzzz in #7826
- [Unified Checkpoint] Fix unified checkpoint by empty cache. by @ZHUI in #7855
- [Fix Download] update converted logic & fix hf hub download subfolder bug by @JunnYu in #7911
- [Cherry-pick] logger level by @KB-Ding in #7920
- [Cherry-pick] RuntimeTimer for the toolkit (#7913) by @KB-Ding in #7921
- [Release] 2.7.2 for paddlenlp bugfix. by @ZHUI in #7892
Full Changelog: v2.7.1...v2.7.2
v2.7.1
本版本做了一些小问题的修复
What's Changed
- 修复了训练恢复遇到的一些问题 @ZHUI in #7771
- 修复了GPT在Pipeline模式下的初始化问题 @DrownFish19 in #7775
- 修复了dist dataloader评估时的问题。 @DesmonDay in #7778
Full Changelog: v2.7.0...v2.7.1
PaddleNLP 2.7.0 Release Note
很高兴地通知大家,飞桨大模型套件发布v2.7.0版本。这个版本中,我们深入优化套件的大模型能力。从易用性、性能、到稳定性都有巨大提升。
总体而言,当前版本更新有以下亮点:
- 统一工具链大模型入口。统一预训练、精调、压缩、推理以及部署等环节的实现代码,到 PaddleNLP/llm目录。
- 全新大模型工具链文档。一站式指引用户从大模型入门到业务部署上线。文档见: https://paddlenlp.readthedocs.io/zh/latest/llm/finetune.html
- 全断点存储机制 Unified Checkpoint。 在存储断点时将模型权重、优化器权重等进行统一safetensors格式存储,不再区分分布式策略存储,并且支持恢复训练的动态扩缩容,大大提高大模型存储的通用性。
- 高效微调升级。支持了高效微调+LoRA同时使用,支持了QLoRA等算法。
大模型训推全流程
- 预训练
- 统一了预训练入口到
llm/run_pretrain.py
。 - 支持了qwen 等模型预训练,支持flash attention。
- 统一了预训练入口到
- 精调
- 支持可LoRA + Linear量化同时使用
- 支持了流水线并行模型 + lora一起使用
- 支持了NEFTune方法
- 添加了QLoRA支持
- 压缩
- 支持PTQ、QAT量化功能,包括A8W8、WINT8、WINT4、A8W4
- 支持SmoothQuant、GPTQ、AWQ等量化算法
Unified Checkpoint
- 在大模型背景下,通常我们需要进行多卡分布式的训练,在保存Checkpoint时所得到的模型权重通常是分片放置的,例如根据张量并行、流水线并行进行切分保存。这种根据分布式策略直接存储Checkpoint的方式非常直接明了,但也存在如下的问题:
- 对下游推理不够友好,当用户希望获取中间阶段保存的Checkpoint做下游推理时,需要手动对模型权重进行合并。
- 不利于应对做恢复训练时,可能会面临的分布式策略改变、训练节点数发生变化的情况。用户往往需要手动对Checkpoint进行处理,增加了操作复杂度。
- 为了最大程度地解决上述的问题,降低用户操作难度,我们对大模型存储框架进行了升级,提出了大模型统一存储方案——Unified Checkpoint。Unified Checkpoint的核心思想是将模型权重、优化器权重等进行统一safetensors格式存储,在Checkpoint存储时不再对分布式策略进行区分,提高大模型存储的通用性。
- Unified Checkpoint具备以下功能与特点:
- 权重存储不区分分布式策略,并采用safetensors格式统一存储;
- 灵活支持大模型训练扩容、缩容等各种情况,能够适配不同分布式训练策略的切换。
模型新增
moka-ai/m3e-base
检索模型BAAI/bge-small-zh-v1.5
检索模型
基础框架升级
- Trainer 升级
- 支持了 "--skip_memory_metrics 0"是,显示实时 显存、内存占用
- 支持 "--unified_checkpoint" "--unified_checkpoint_config" 支持混合并行下模型save,动态扩缩容重启。
- 新增 PretrainModelPipe基础类,支持流水线并行训练。
其他支持 - 支持了paddlenlp commit id 展示
paddlenlp.version.commit
- 支持AI Studio download add save to aistudio hub
问题修复
- 修复了dist_dataloader的一些问题
- 修复了一些模型动转静问题
- 修复了GPT训练的一些bug,移除了GPT2。修复了一些seed设置问题
- 修复了baichuan模型在流水线并行的一些问题。
New Contributors
- @Wennie396 made their first contribution in #6897
- @Wong4j made their first contribution in #7008
- @yuanlehome made their first contribution in #7080
- @Xreki made their first contribution in #7105
- @Tom-Zheng made their first contribution in #7092
- @TimeYWL made their first contribution in #7122
- @From00 made their first contribution in #7168
- @RichardWooSJTU made their first contribution in #7186
- @heavyrain-lzy made their first contribution in #7269
- @LokeZhou made their first contribution in #7337
- @JZ-LIANG made their first contribution in #7301
- @WAI-clear made their first contribution in #7402
- @tianhaodongbd made their first contribution in #7293
- @zzjjay made their first contribution in #7504
- @anexplore made their first contribution in #7558
- @niuliling123 made their first contribution in #7528
- @zxcd made their first contribution in #7577
- @MayYouBeProsperous made their first contribution in #7575
- @iosmers made their first contribution in #7613
- @AndSonder made their first contribution in #7343
- @zhink made their first contribution in #7679
- @kingTLE made their first contribution in #7708
Full Changelog: v2.6.1...v2.7.0
v2.6.1
What's Changed
在v2.6.1版本中,我们做了大量的bug修复,提高了LLM模型和相关组件的稳定性。除了bug修复以外,主要新增功能如下:
- LLM:新增了 qwen 模型,InTokens数据流兼容了Pipeline Parallel,LLM精调支持从多个训练文件加载以及热启动,增强了LLaMA模型的不同recompute粒度
- Trainer: hybrid_parallel_topo_order 选项,并修复了 sharding stage3 的保存模型。
- Paddle-pipelines: 添加了对 ERNIE-Bot-turbo和ERNIE-embedding 的支持, 更新了分层搜索示例并且增强了 ChatPaper 的UI
- Megatron 数据集:添加了加载 megatron 数据集的支持,支持ernie-1.0和T5数据类型
New Contributors
- @xiezheng-XD made their first contribution in #6764
- @carryyu made their first contribution in #6676
- @xiaoxiaohehe001 made their first contribution in #6798
- @MARD1NO made their first contribution in #6865
- @zhoutianzi666 made their first contribution in #6905
- @lchdl made their first contribution in #6964
- @LaiXinyi823 made their first contribution in #6659
Full Changelog: v2.6.0...v2.6.1