-
Notifications
You must be signed in to change notification settings - Fork 45
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
ee6b420
commit 4509f43
Showing
42 changed files
with
10,235 additions
and
88 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,61 @@ | ||
# ChineseEHRBert | ||
A Chinese EHR Bert Pretrained Model. | ||
A Chinese Electric Health Record Bert Pretrained Model. | ||
|
||
|
||
[中文版](./README_zh.md) | ||
|
||
# cleaner | ||
The cleaner is responsible for cleaning txt files, which is used for training a Chinese bert model. The cleaner split lines in original lines into small lines. Each small line is a complete sentence with a punctuation. This is required for training next sentence prediction task. | ||
|
||
## usage | ||
``` | ||
cd ./cleaner/ | ||
python parser.py [-h] [--input INPUT] [--output OUTPUT] [-s] [--log LOG] | ||
``` | ||
- --input: input directory | ||
- --output: output directory | ||
- -s: output is one single file | ||
- --log: log frequency | ||
|
||
# train | ||
Pre-train a bert model with cleaned text. We should generate .tfrecord first, and pre-train with google's code. To notice, cleaner file may be too big to load in RAM. Our script splits these files and generate multiple .tfrecord. | ||
|
||
## usage | ||
Split file and convert to .tfrecord | ||
``` | ||
cd ./train/ | ||
python make_pretrain_bert.py [-h] [-f FILE_PATH] [-s SPLIT_LINE] | ||
[-p SPLIT_PATH] [-o OUTPUT_PATH] [-l MAX_LENGTH] | ||
[-b BERT_BASE_DIR] | ||
``` | ||
- -f: cleaned file path | ||
- -s: split line count, default=500000 | ||
- -p: splited file save path | ||
- -o: .tfrecord save path | ||
- -l: max length | ||
- -b: bert base dir | ||
|
||
One should change parameters for your specific requirement in **pretrain128.sh** and **pretrain512.sh**. | ||
``` | ||
sh pretrain128.sh | ||
sh pretrain512.sh | ||
``` | ||
|
||
# test | ||
Test four Chinese medical NLP tasks by BERT in one line! Two NER tasks, one QA task and one RE task. | ||
``` | ||
cd ./test/ | ||
sh run_test.sh | ||
``` | ||
Tasks include [CCKS2019NER](https://www.biendata.com/competition/CCKS2019_1/), [cMedQA2](/~https://github.com/zhangsheng93/cMedQA2), [Tianchi\_NER](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [Tianchi\_RE](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288). | ||
|
||
# Results | ||
Results compared with original BERT and ChineseEHRBert. Results are preparing. | ||
|
||
# Citation | ||
|
||
# Author | ||
- [Zheng Yuan](/~https://github.com/GanjinZero) | ||
- Peng Zhao | ||
- Chen Yu | ||
- [Sheng Yu](http://www.stat.tsinghua.edu.cn/teambuilder/faculty/yusheng/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# ChineseEHRBert | ||
中文电子病历Bert预训练模型 | ||
|
||
|
||
[English Version](./README.md) | ||
|
||
# cleaner | ||
cleaner可以将文件清理为预训练bert需要的格式。将原文件按标点符号切割为行。 | ||
|
||
## 用法 | ||
``` | ||
cd ./cleaner/ | ||
python parser.py [-h] [--input INPUT] [--output OUTPUT] [-s] [--log LOG] | ||
``` | ||
- --input: 输入文件夹 | ||
- --output: 输出文件夹 | ||
- -s: 输出是否是单个文件 | ||
- --log: log频率 | ||
|
||
# train | ||
进行预训练之前需要先生成tfrecord文件。因为需要训练的文本可能很大,脚本会先进行切分。 | ||
|
||
## 用法 | ||
切分和生成tfrecord | ||
``` | ||
cd ./train/ | ||
python make_pretrain_bert.py [-h] [-f FILE_PATH] [-s SPLIT_LINE] | ||
[-p SPLIT_PATH] [-o OUTPUT_PATH] [-l MAX_LENGTH] | ||
[-b BERT_BASE_DIR] | ||
``` | ||
- -f: 清理完的输入文件夹 | ||
- -s: 分割行数, 默认=500000 | ||
- -p: 分割文件保存位置 | ||
- -o: .tfrecord文件保存位置 | ||
- -l: 句子最长字数 | ||
- -b: bert文件夹(需要从google下载) | ||
|
||
**pretrain128.sh**和**pretrain512.sh**的参数需要根据需要自行修改。 | ||
``` | ||
sh pretrain128.sh | ||
sh pretrain512.sh | ||
``` | ||
|
||
# test | ||
一行测试4个中文NLP任务!两个NER任务,一个RE任务,一个QA任务。具体说明见**./test/readme.md**。 | ||
``` | ||
cd ./test/ | ||
sh run_test.sh | ||
``` | ||
包含了如下任务[CCKS2019NER](https://www.biendata.com/competition/CCKS2019_1/), [cMedQA2](/~https://github.com/zhangsheng93/cMedQA2), [Tianchi\_NER](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [Tianchi\_RE](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288)。 | ||
|
||
# Results | ||
结果包括用Google训练的中文Bert和用ChineseEhrBert分别fine-tune之后的结果。结果正在准备中。 | ||
|
||
# Citation | ||
|
||
# Author | ||
- [袁正](/~https://github.com/GanjinZero) | ||
- 赵芃 | ||
- 俞辰 | ||
- [俞声](http://www.stat.tsinghua.edu.cn/teambuilder/faculty/yusheng/) |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.