Add test

GanjinZero · Feb 2, 2020 · 4509f43 · 4509f43
1 parent ee6b420
commit 4509f43
Show file tree

Hide file tree

Showing 42 changed files with 10,235 additions and 88 deletions.
diff --git a/NER/README.md b/NER/README.md
diff --git a/README.md b/README.md
@@ -1,2 +1,61 @@
 # ChineseEHRBert
-A Chinese EHR Bert Pretrained Model.
+A Chinese Electric Health Record Bert Pretrained Model.
+
+
+[中文版](./README_zh.md)
+
+# cleaner
+The cleaner is responsible for cleaning txt files, which is used for training a Chinese bert model. The cleaner split lines in original lines into small lines. Each small line is a complete sentence with a punctuation. This is required for training next sentence prediction task.
+
+## usage
+```
+cd ./cleaner/
+python parser.py [-h] [--input INPUT] [--output OUTPUT] [-s] [--log LOG]
+```
+- --input: input directory
+- --output: output directory
+- -s: output is one single file
+- --log: log frequency
+
+# train
+Pre-train a bert model with cleaned text. We should generate .tfrecord first, and pre-train with google's code. To notice, cleaner file may be too big to load in RAM. Our script splits these files and generate multiple .tfrecord.
+
+## usage
+Split file and convert to .tfrecord
+```
+cd ./train/
+python make_pretrain_bert.py [-h] [-f FILE_PATH] [-s SPLIT_LINE]
+                             [-p SPLIT_PATH] [-o OUTPUT_PATH] [-l MAX_LENGTH]
+                             [-b BERT_BASE_DIR]
+```
+- -f: cleaned file path
+- -s: split line count, default=500000
+- -p: splited file save path
+- -o: .tfrecord save path
+- -l: max length
+- -b: bert base dir
+
+One should change parameters for your specific requirement in **pretrain128.sh** and **pretrain512.sh**.
+```
+sh pretrain128.sh
+sh pretrain512.sh
+```
+
+# test
+Test four Chinese medical NLP tasks by BERT in one line! Two NER tasks, one QA task and one RE task.
+```
+cd ./test/
+sh run_test.sh
+```
+Tasks include [CCKS2019NER](https://www.biendata.com/competition/CCKS2019_1/), [cMedQA2](/~https://github.com/zhangsheng93/cMedQA2), [Tianchi\_NER](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [Tianchi\_RE](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288).
+
+# Results
+Results compared with original BERT and ChineseEHRBert. Results are preparing.
+
+# Citation
+
+# Author
+- [Zheng Yuan](/~https://github.com/GanjinZero)
+- Peng Zhao
+- Chen Yu
+- [Sheng Yu](http://www.stat.tsinghua.edu.cn/teambuilder/faculty/yusheng/)
diff --git a/README_zh.md b/README_zh.md
@@ -0,0 +1,61 @@
+# ChineseEHRBert
+中文电子病历Bert预训练模型
+
+
+[English Version](./README.md)
+
+# cleaner
+cleaner可以将文件清理为预训练bert需要的格式。将原文件按标点符号切割为行。
+
+## 用法
+```
+cd ./cleaner/
+python parser.py [-h] [--input INPUT] [--output OUTPUT] [-s] [--log LOG]
+```
+- --input: 输入文件夹
+- --output: 输出文件夹
+- -s: 输出是否是单个文件
+- --log: log频率
+
+# train
+进行预训练之前需要先生成tfrecord文件。因为需要训练的文本可能很大，脚本会先进行切分。
+
+## 用法
+切分和生成tfrecord
+```
+cd ./train/
+python make_pretrain_bert.py [-h] [-f FILE_PATH] [-s SPLIT_LINE]
+                             [-p SPLIT_PATH] [-o OUTPUT_PATH] [-l MAX_LENGTH]
+                             [-b BERT_BASE_DIR]
+```
+- -f: 清理完的输入文件夹
+- -s: 分割行数, 默认=500000
+- -p: 分割文件保存位置
+- -o: .tfrecord文件保存位置
+- -l: 句子最长字数
+- -b: bert文件夹（需要从google下载）
+
+**pretrain128.sh**和**pretrain512.sh**的参数需要根据需要自行修改。
+```
+sh pretrain128.sh
+sh pretrain512.sh
+```
+
+# test
+一行测试4个中文NLP任务!两个NER任务，一个RE任务，一个QA任务。具体说明见**./test/readme.md**。
+```
+cd ./test/
+sh run_test.sh
+```
+包含了如下任务[CCKS2019NER](https://www.biendata.com/competition/CCKS2019_1/), [cMedQA2](/~https://github.com/zhangsheng93/cMedQA2), [Tianchi\_NER](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [Tianchi\_RE](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288)。
+
+# Results
+结果包括用Google训练的中文Bert和用ChineseEhrBert分别fine-tune之后的结果。结果正在准备中。
+
+# Citation
+
+# Author
+- [袁正](/~https://github.com/GanjinZero)
+- 赵芃
+- 俞辰
+- [俞声](http://www.stat.tsinghua.edu.cn/teambuilder/faculty/yusheng/)
diff --git a/albert/make_pretrain_albert.py b/albert/make_pretrain_albert.py
diff --git a/cleaner/README.md b/cleaner/README.md
@@ -15,4 +15,4 @@ python3 parser.py -h #查看使用方式
 
 处理中和处理结束时会显示已处理的有效行数（指非空行，也就是包含空格和换行符以外字符的行），可以设置--log来调整输出的频率。
 
-详细用法请-h查看。
+详细用法请-h查看。
Original file line number	Diff line number	Diff line change
Expand Up		@@ -15,4 +15,4 @@ python3 parser.py -h #查看使用方式

		处理中和处理结束时会显示已处理的有效行数（指非空行，也就是包含空格和换行符以外字符的行），可以设置--log来调整输出的频率。

		详细用法请-h查看。
		详细用法请-h查看。