This repo contains code and data for the paper Pun Generation with Surprise.
- Python 3.6
- Pytorch 0.4
conda install pytorch=0.4.0 torchvision -c pytorch
- Fairseq(-py)
git clone -b pungen /~https://github.com/hhexiy/fairseq.git
cd fairseq
pip install -r requirements.txt
python setup.py build develop
- Pretrained WikiText-103 model from Fairseq
curl --create-dirs --output models/wikitext/model https://dl.fbaipublicfiles.com/fairseq/models/wiki103_fconv_lm.tar.bz2
tar xjf models/wikitext/model -C models/wikitext
rm models/wikitext/model
We approximate relatedness between a pair of words with a long-distance skip-gram model trained on BookCorpus sentences.
The original BookCorpus data is parsed by scripts/preprocess_raw_text.py
and you can see the sample file in sample_data/bookcorpus/raw/train.txt
.
Preprocess bookcorpus data:
python -m pungen.wordvec.preprocess --data-dir data/bookcorpus/skipgram \
--corpus data/bookcorpus/raw/train.txt \
--min-dist 5 --max-dist 10 --threshold 80 \
--vocab data/bookcorpus/skipgram/dict.txt
Train skip-gram model:
python -m pungen.wordvec.train --weights --cuda --data data/bookcorpus/skipgram/train.bin \
--save_dir models/bookcorpus/skipgram \
--mb 3500 --epoch 15 \
--vocab data/bookcorpus/skipgram/dict.txt
The edit model takes a word and a template (masked sentence) and combine the two coherently.
Preprocess data:
for split in train valid; do \
PYTHONPATH=. python scripts/make_src_tgt_files.py -i data/bookcorpus/raw/$split.txt \
-o data/bookcorpus/edit/$split --delete-frac 0.5 --window-size 2 --random-window-size; \
done
python -m pungen.preprocess --source-lang src --target-lang tgt \
--destdir data/bookcorpus/edit/bin/data --thresholdtgt 80 --thresholdsrc 80 \
--validpref data/bookcorpus/edit/valid \
--trainpref data/bookcorpus/edit/train \
--workers 8
Training:
python -m pungen.train data/bookcorpus/edit/bin/data -a lstm \
--source-lang src --target-lang tgt \
--task edit --insert deleted --combine token \
--criterion cross_entropy \
--encoder lstm --decoder-attention True \
--optimizer adagrad --lr 0.01 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
--clip-norm 5 --max-epoch 50 --max-tokens 7000 --no-epoch-checkpoints \
--save-dir models/bookcorpus/edit/deleted --no-progress-bar --log-interval 5000
Build a sentence retriever based on Bookcorpus. The input should have a tokenized sentence per line.
python -m pungen.retriever --doc-file data/bookcorpus/raw/sent.tokenized.txt \
--path models/bookcorpus/retriever.pkl --overwrite
Compute correlation between local-global suprise scores and human funniness ratings.
We provide our annotated dataset in data/funniness_annotation
:
analysis_pun_scores.txt
: sentences annotated with funniness scores from 1 to 5.analysis_zscored_pun_scores.txt
: the same data where scores are standardized for each annotator.
python eval_scoring_func.py --human-eval data/funniness_annotation/analysis_zscored_pun_scores.txt \
--lm-path models/wikitext/wiki103.pt --word-counts-path models/wikitext/dict.txt \
--skipgram-model data/bookcorpus/skipgram/dict.txt \
models/bookcorpus/skipgram/sgns-e15.pt \
--outdir results/pun-analysis/analysis_zscored \
--features grammar ratio --analysis --ignore-cache
We generate puns given a pair of pun word and alternative word.
We support pun generation with the following methods specified by the system
argument.
rule
: the SURGEN method described in the paperrule+neural
: in the last step of SURGEN, use a neural combiner to edit the topic wordsretrieve
: retrieve a sentence containing the pun wordretrieve+swap
: retrieve a sentence containing the alternative word and replace it with the pun word For arguments controlling the neural generator (e.g.,--beam
,--nbest
), seefairseq.options
. All results and logs are saved inoutdir
.
python generate_pun.py data/bookcorpus/edit/bin/data \
--path models/bookcorpus/edit/delete/checkpoint_best.pt \
--beam 20 --nbest 1 --unkpen 100 \
--system rule --task edit \
--retriever-model models/bookcorpus/retriever.pkl --doc-file data/bookcorpus/raw/sent.tokenized.txt \
--lm-path models/wikitext/wiki103.pt --word-counts-path models/wikitext/dict.txt \
--skipgram-model data/bookcorpus/skipgram/dict.txt models/bookcorpus/skipgram/sgns-e15.pt \
--num-candidates 500 --num-templates 100 \
--num-topic-word 100 --type-consistency-threshold 0.3 \
--pun-words data/semeval/hetero/dev.json \
--outdir results/semeval/hetero/dev/rule \
--scorer random \
--max-num-examples 100
If you use the annotated SemEval pun dataset, please cite our paper:
@inproceedings{he2019pun,
title={Pun Generation with Surprise},
author={He He and Nanyun Peng and Percy Liang},
booktitle={North American Association for Computational Linguistics (NAACL)},
year={2019}
}