Skip to content

Commit

Permalink
Add test
Browse files Browse the repository at this point in the history
  • Loading branch information
GanjinZero committed Feb 2, 2020
1 parent ee6b420 commit 4509f43
Show file tree
Hide file tree
Showing 42 changed files with 10,235 additions and 88 deletions.
45 changes: 0 additions & 45 deletions NER/README.md

This file was deleted.

61 changes: 60 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,61 @@
# ChineseEHRBert
A Chinese EHR Bert Pretrained Model.
A Chinese Electric Health Record Bert Pretrained Model.


[中文版](./README_zh.md)

# cleaner
The cleaner is responsible for cleaning txt files, which is used for training a Chinese bert model. The cleaner split lines in original lines into small lines. Each small line is a complete sentence with a punctuation. This is required for training next sentence prediction task.

## usage
```
cd ./cleaner/
python parser.py [-h] [--input INPUT] [--output OUTPUT] [-s] [--log LOG]
```
- --input: input directory
- --output: output directory
- -s: output is one single file
- --log: log frequency

# train
Pre-train a bert model with cleaned text. We should generate .tfrecord first, and pre-train with google's code. To notice, cleaner file may be too big to load in RAM. Our script splits these files and generate multiple .tfrecord.

## usage
Split file and convert to .tfrecord
```
cd ./train/
python make_pretrain_bert.py [-h] [-f FILE_PATH] [-s SPLIT_LINE]
[-p SPLIT_PATH] [-o OUTPUT_PATH] [-l MAX_LENGTH]
[-b BERT_BASE_DIR]
```
- -f: cleaned file path
- -s: split line count, default=500000
- -p: splited file save path
- -o: .tfrecord save path
- -l: max length
- -b: bert base dir

One should change parameters for your specific requirement in **pretrain128.sh** and **pretrain512.sh**.
```
sh pretrain128.sh
sh pretrain512.sh
```

# test
Test four Chinese medical NLP tasks by BERT in one line! Two NER tasks, one QA task and one RE task.
```
cd ./test/
sh run_test.sh
```
Tasks include [CCKS2019NER](https://www.biendata.com/competition/CCKS2019_1/), [cMedQA2](/~https://github.com/zhangsheng93/cMedQA2), [Tianchi\_NER](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [Tianchi\_RE](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288).

# Results
Results compared with original BERT and ChineseEHRBert. Results are preparing.

# Citation

# Author
- [Zheng Yuan](/~https://github.com/GanjinZero)
- Peng Zhao
- Chen Yu
- [Sheng Yu](http://www.stat.tsinghua.edu.cn/teambuilder/faculty/yusheng/)
61 changes: 61 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# ChineseEHRBert
中文电子病历Bert预训练模型


[English Version](./README.md)

# cleaner
cleaner可以将文件清理为预训练bert需要的格式。将原文件按标点符号切割为行。

## 用法
```
cd ./cleaner/
python parser.py [-h] [--input INPUT] [--output OUTPUT] [-s] [--log LOG]
```
- --input: 输入文件夹
- --output: 输出文件夹
- -s: 输出是否是单个文件
- --log: log频率

# train
进行预训练之前需要先生成tfrecord文件。因为需要训练的文本可能很大,脚本会先进行切分。

## 用法
切分和生成tfrecord
```
cd ./train/
python make_pretrain_bert.py [-h] [-f FILE_PATH] [-s SPLIT_LINE]
[-p SPLIT_PATH] [-o OUTPUT_PATH] [-l MAX_LENGTH]
[-b BERT_BASE_DIR]
```
- -f: 清理完的输入文件夹
- -s: 分割行数, 默认=500000
- -p: 分割文件保存位置
- -o: .tfrecord文件保存位置
- -l: 句子最长字数
- -b: bert文件夹(需要从google下载)

**pretrain128.sh****pretrain512.sh**的参数需要根据需要自行修改。
```
sh pretrain128.sh
sh pretrain512.sh
```

# test
一行测试4个中文NLP任务!两个NER任务,一个RE任务,一个QA任务。具体说明见**./test/readme.md**
```
cd ./test/
sh run_test.sh
```
包含了如下任务[CCKS2019NER](https://www.biendata.com/competition/CCKS2019_1/), [cMedQA2](/~https://github.com/zhangsheng93/cMedQA2), [Tianchi\_NER](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [Tianchi\_RE](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288)

# Results
结果包括用Google训练的中文Bert和用ChineseEhrBert分别fine-tune之后的结果。结果正在准备中。

# Citation

# Author
- [袁正](/~https://github.com/GanjinZero)
- 赵芃
- 俞辰
- [俞声](http://www.stat.tsinghua.edu.cn/teambuilder/faculty/yusheng/)
41 changes: 0 additions & 41 deletions albert/make_pretrain_albert.py

This file was deleted.

2 changes: 1 addition & 1 deletion cleaner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@ python3 parser.py -h #查看使用方式

处理中和处理结束时会显示已处理的有效行数(指非空行,也就是包含空格和换行符以外字符的行),可以设置--log来调整输出的频率。

详细用法请-h查看。
详细用法请-h查看。
Loading

0 comments on commit 4509f43

Please sign in to comment.