Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in How to eval KAZU-NER model #1

Open
Kik099 opened this issue Aug 19, 2024 · 19 comments
Open

Error in How to eval KAZU-NER model #1

Kik099 opened this issue Aug 19, 2024 · 19 comments

Comments

@Kik099
Copy link

Kik099 commented Aug 19, 2024

I got this error "FileNotFoundError: Unable to find '\Users\kkiko\KAZU-NER-exp\BC5CDR_test\dev.prob_conll' at C:\Users\kkiko\KAZU-NER-exp\BC5CDR_test\prob_conll".

I found that following the steps in the README.md does not create the file dev.prob_conll.

What do i need to do?

@wonjininfo
Copy link
Member

Hi @Kik099

Thank you for your interest in our work.

The label2prob.py script is designed to create any splits: train.prob_conll, dev.prob_conll, and test.prob_conll files.

But I just found out that I only wrote the sample scripts for the test split. They can be applied to all splits.

export DATA_DIR=${HOME}/KAZU-NER-exp/BC5CDR_test # Please use the absolute path to avoid unexpected errors 

If you have set DATA_DIR, you can run the following commands:

ls ${DATA_DIR}

This should display the train.tsv, dev.tsv, and test.tsv files.

Please run the following code for each of the train.tsv, dev.tsv, and test.tsv splits:

export IS_IO="" # Set this if you are using IO tagging.

python label2prob.py --label ${DATA_DIR}/labels.txt --file_path ${DATA_DIR}/test.tsv --output_path ${DATA_DIR}/test.prob_conll ${IS_IO}
python label2prob.py --label ${DATA_DIR}/labels.txt --file_path ${DATA_DIR}/train.tsv --output_path ${DATA_DIR}/train.prob_conll ${IS_IO}
python label2prob.py --label ${DATA_DIR}/labels.txt --file_path ${DATA_DIR}/dev.tsv --output_path ${DATA_DIR}/dev.prob_conll ${IS_IO}

Please let me know if this does not work. I will update Readme too.

Thanks,
WonJin

@Kik099
Copy link
Author

Kik099 commented Aug 20, 2024

Hi @wonjininfo

With what you said it worked for that error. Like you can see it started running:

Running tokenizer on prediction dataset #0: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.10ba/s] Running tokenizer on prediction dataset #3: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.61ba/s] Running tokenizer on prediction dataset #1: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.23ba/s] Running tokenizer on prediction dataset #2: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.24ba/s] Running tokenizer on prediction dataset #6: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.57ba/s] Running tokenizer on prediction dataset #5: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.24ba/s] Running tokenizer on prediction dataset #4: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.14ba/s] Running tokenizer on prediction dataset #7: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 2.33ba/s]

But then a few minutes later a error appeared. I put in the text file the output, soo it is easy to see.

Do you know what i need to do?

error_file.json

And thanks for your help

@wonjininfo
Copy link
Member

Hi @Kik099

Thank you for sharing the log file.

I found it a bit challenging to trace the issue at the moment, but I suspect it may be related to the number of labels. Let's start with the simplest approach. If you’re only trying to evaluate the model, and not training it, we can copy the test.prob_conll file and use it to create train.prob_conll and dev.prob_conll. This will give us three identical files with different names. Please try running the process again with these files.

Additionally, could you please share the exact command line you used in the shell and the version of transformers you're working with? This will help me to replicate and trace the error.

If you’re uncomfortable sharing this information here (as this is a publicly open space), feel free to email me by
wonjin.info (at) gmail.com

@Kik099
Copy link
Author

Kik099 commented Aug 21, 2024

Hi @wonjininfo,

I followed your instructions but encountered the same error when I have three identical files with different names.

Regarding the libraries I have installed, here is the list:

datasets 1.18.3 pypi_0 pypi
torch 1.8.1 pypi_0 pypi
transformers 4.30.2 pypi_0 pypi
seqeval 1.2.2 pypi_0 pypi
accelerate 0.20.3 pypi_0 pypi
aiohttp 3.8.6 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
async-timeout 4.0.3 pypi_0 pypi
asynctest 0.13.0 pypi_0 pypi
attrs 24.2.0 pypi_0 pypi
ca-certificates 2024.7.2 haa95532_0
certifi 2024.7.4 pypi_0 pypi
charset-normalizer 3.3.2 pypi_0 pypi
colorama 0.4.6 pypi_0 pypi
dill 0.3.7 pypi_0 pypi
filelock 3.12.2 pypi_0 pypi
frozenlist 1.3.3 pypi_0 pypi
fsspec 2023.1.0 pypi_0 pypi
huggingface-hub 0.16.4 pypi_0 pypi
idna 3.7 pypi_0 pypi
importlib-metadata 6.7.0 pypi_0 pypi
joblib 1.3.2 pypi_0 pypi
multidict 6.0.5 pypi_0 pypi
multiprocess 0.70.15 pypi_0 pypi
numpy 1.21.6 pypi_0 pypi
openssl 1.1.1w h2bbff1b_0
packaging 24.0 pypi_0 pypi
pandas 1.3.5 pypi_0 pypi
pip 22.3.1 py37haa95532_0
psutil 6.0.0 pypi_0 pypi
pyarrow 12.0.1 pypi_0 pypi
python 3.7.13 h6244533_1
python-dateutil 2.9.0.post0 pypi_0 pypi
pytz 2024.1 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
regex 2024.4.16 pypi_0 pypi
requests 2.31.0 pypi_0 pypi
safetensors 0.4.4 pypi_0 pypi
scikit-learn 1.0.2 pypi_0 pypi
scipy 1.7.3 pypi_0 pypi
setuptools 65.6.3 py37haa95532_0
six 1.16.0 pypi_0 pypi
sqlite 3.45.3 h2bbff1b_0
threadpoolctl 3.1.0 pypi_0 pypi
tokenizers 0.13.3 pypi_0 pypi
tqdm 4.66.5 pypi_0 pypi
typing-extensions 4.7.1 pypi_0 pypi
urllib3 2.0.7 pypi_0 pypi
vc 14.40 h2eaa2aa_0
vs2015_runtime 14.40.33807 h98bb1dd_0
wget 3.2 pypi_0 pypi
wheel 0.38.4 py37haa95532_0
wincertstore 0.2 py37haa95532_2
xxhash 3.5.0 pypi_0 pypi
yarl 1.9.4 pypi_0 pypi
zipp 3.15.0 pypi_0 pypi

When I attempted to install the following libraries:

torch==1.8.2
transformers==4.9.2
datasets==1.18.3
seqeval>=1.2.2
I encountered an error indicating that the installation of transformers==4.9.2 and datasets==1.18.3 led to a conflict.

Here’s the error message:

pip install torch==1.8.2 transformers==4.9.2 datasets==1.18.3 seqeval>=1.2.2 ERROR: Could not find a version that satisfies the requirement torch==1.8.2 (from versions: 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1) ERROR: No matching distribution found for torch==1.8.2

To resolve this, I installed torch==1.8.1 instead. However, this led to another error:

pip install torch==1.8.1 transformers==4.9.2 datasets==1.18.3 seqeval>=1.2.2 ERROR: Cannot install datasets==1.18.3 and transformers==4.9.2 because these package versions have conflicting dependencies. ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

Please let me know how I can resolve this issue.

Best regards,
Rodrigo Saraiva

@wonjininfo
Copy link
Member

Thanks for sharing. I will work on that and get back to you soon.

@Kik099
Copy link
Author

Kik099 commented Aug 22, 2024

I just have another question, i already i trained the model. How do i run it know? I am developing a thesis and i will talk about your article there.

I have the following files that are in the dir "_tmp\output\MultiLabelNER-test":

all_results.json
config.json
folder- multi_label_seq_eval
pytorch_model.bin
special_tokens_map.json
tokenizer.json
tokenizer_config.json
trainer_state.json
training_args.bin
train_results.json
vocab.txt

@wonjininfo
Copy link
Member

wonjininfo commented Aug 22, 2024

Hi @Kik099 ,

i already i trained the model.

Does this mean that the previous error is no longer occurring?

Regarding your other question:

How do i run it know?

Could you please clarify what you mean? I noticed there might be some typos, so I want to make sure I understand your question correctly.

@Kik099
Copy link
Author

Kik099 commented Aug 22, 2024

Hi @wonjininfo

Does this mean that the previous error is no longer occurring?

Unfortunately, the error is still appearing during the evaluation phase.

Could you please clarify what you mean?

Additionally, after training the model, I would like to use it for token classification. How can I input text to the model to obtain token classifications?

@Kik099
Copy link
Author

Kik099 commented Aug 26, 2024

Hi @wonjininfo
Do you understood what I explained ?

Best,

Rodrigo Saraiva

@wonjininfo
Copy link
Member

Hi Rodrigo ,

I spent a few hours resolving the dependency issues and identified some points that needed updating because they no longer worked. I have updated them in the README.


Recommended Solution: Using Python v3.7.13 is suggested for compatibility.

Install pip install transformers==4.10.3 datasets==1.18.3 seqeval==1.2.2


If you use higher version of python, we need another version of tokenizers.

If you encounter the error error: can't find Rust compiler while installing transformers, this may related to your python version.
As noted in this comment, older version of tokenizers are not compatible with newer version of python.


Alternative Solution (Not Recommended): Alternatively, you can use Python v3.10.12 and install the libraries using the following command:

# Tested torch version: torch==2.1.0 CUDA 12.1
#pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

pip install transformers==4.16.2 tokenizers==0.12 datasets==1.18.3 seqeval==1.2.2

After that, I followed my README and, for testing purposes, copied the development file to the test file using cp ${DATA_DIR}/test.prob_conll ${DATA_DIR}/dev.prob_conll.
This let me run the code without any errors, but unfortunately, I couldn't reproduce your issue.
I tested it on Ubuntu 22.04.4 LTS with Python 3.10.12. The problem might be due to differences in the OS, but I'm not entirely sure.

@wonjininfo
Copy link
Member

How can I input text to the model to obtain token classifications?

To input text into the model for token classification, you need to preprocess your text to match the format of the test.prob_conll file.

  1. Preprocess Your Text:

    • Ensure your text has the same format as the test.prob_conll file.
    • Use this file as a replacement for --test_file ${DATA_DIR}/test.prob_conll.
  2. Specify Your Model:

    • If you are not using our pre-trained model ("dmis-lab/KAZU-NER-module-distil-v1.0"), set the location of your model using the $BERT_MODEL variable:
      --model_name_or_path $BERT_MODEL
  3. Update Labels if Needed:

    • If your model is trained on a different set of entities, make sure to update ${DATA_DIR}/labels.txt accordingly.

@Kik099
Copy link
Author

Kik099 commented Aug 29, 2024

Hi @wonjininfo

Thank you so much for your reply—I really appreciate it.

To summarize, it seems that there was a demo website for testing KAZU (http://kazu.korea.ac.kr/), which is currently not working. On that website, we could input a simple phrase and obtain the token classifications.

To achieve this in the KAZU project (/~https://github.com/AstraZeneca/KAZU), we can run the following code:

def kazu_test(cfg):
pipeline: Pipeline = instantiate(cfg.Pipeline)
text = "EGFR mutations are often implicated in lung cancer"
doc = Document.create_simple_document(text)
pipeline([doc])
print(f"{doc.get_entities()}")

Running this script produces the token classifications for the words in the text.

Are you saying that to achieve this in this project, we need to convert the phrase to the test.prob_conll format?

If so, which code should I run—the evaluation code?

@wonjininfo
Copy link
Member

For the demo website, it is currently managed by my former colleagues, who are postgraduate students at Korea University. I have asked them to reboot the server.

Regarding the KAZU project, that’s a good point—I was primarily focused on this repository. You can certainly use the KAZU repository (/~https://github.com/AstraZeneca/KAZU), but please note that KAZU is designed for industrial use. It includes additional matching algorithms using ontologies and various features, including preprocessing (from plain text to final output in JSON format). Some of these features might not be suitable for other domains (i.e. non-biomedical/clinical domain), and removing them could be challenging due to the large codebase.

In contrast, this repository focuses exclusively on the core module, emphasizing the neural model aspect of NER recognition (without linking). It is more academically oriented and does not offer end-to-end processing from plain text to final output, so users will need to manage preprocessing and post-processing themselves.

Our label2prob.py script provides a conversion from CoNLL format to our prob_conll format. Once you have your data in CoNLL format, you can use this script to convert it into the required input format. For converting plain text to CoNLL format, other researchers might have shared scripts online, but we did not include such scripts here as our full pipeline is intended to be used with the KAZU repository.

@Kik099
Copy link
Author

Kik099 commented Sep 3, 2024

Hi @wonjininfo

So you are saying that I cannot use this model to predict plain texts, to do that I need to have the plain text in the format of.prob_conll?

If this is the case how can I predict that plain text?
Do I need to put all tokens values to 0.0 ?

Or did I understood wrong?

@wonjininfo
Copy link
Member

You can use this model with the training and evaluation codes to predict any text, but you'll need to write or find some preprocessing code to convert plain text into a CoNLL-like format.

So, the short answer is no—you can't use it as-is. You'll need to write a few dozen lines of code to get it working. I haven't used these myself, but you might find these resources useful: spacy-conll or this Stack Overflow answer. Still, a few tweaks are required.

@Kik099
Copy link
Author

Kik099 commented Oct 22, 2024

hi @wonjininfo

I have trained the model.
Can i now run the evaluation in the trained model? If yes how?
Can i change the bert model to the folder '_tmp/output/MultiLabelNER-test'??

Will this be enough?
I did that and this error appeared
errorEval.json

file

@Kik099
Copy link
Author

Kik099 commented Nov 23, 2024

Hi @wonjininfo sorry but can you help me pls?

@wonjininfo
Copy link
Member

Sorry I do not get what your question is... but I tried to analyze your error log.

If you are questioning about whether you can use your own finetuned model instead of BERT model, than yes. Put the location in export BERT_MODEL="<HERE>" so that it goes to export BERT_MODEL=.
Your log seems to show that you are doing it correctly.

An error I see from your log file, errorEval.json, is that your report might have different structure as I have.

As you can see in this API document:
https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.classification_report.html
it is a dictionary.

In my case, the key is "_" and this might have changed if you are using different settings or different version of sklearn. I think you need to check type of your variable report, and also need to check the key of it and modify evaluate_multi_label.py:

total_report[entity_type] = {
f"precision": report['_']["precision"],
f"recall": report['_']["recall"],
f"f1": report['_']["f1-score"],
f"number": report['_']["support"],
f"accuracy": accuracy_score(y_true=ref_for_type, y_pred=pred_for_type)

I suggest you use pdb to check it:
https://docs.python.org/3/library/pdb.html

@Kik099
Copy link
Author

Kik099 commented Nov 23, 2024

Thanks for the answer @wonjininfo.

I will try to do what you told me, thanks a lot for your help.
I will tell you something when I test that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants