Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Commit

Permalink
Update vocabulary load to a system-agnostic newline (#4342)
Browse files Browse the repository at this point in the history
* Update vocabulary load to a system-agnostic newline

Hello, 
I had a problem about training a model on a Linux machine and loading on a Windows machine.  The error was:
AssertionError: OOV token not found!

After some debugging I found out that during the vocabulary loading, it was splitting by '\n', where this can cause a difference between Linux and Windows. This PR change the split to OS agnostic method of new-line splitting.

* Use a regex because the splitlines algo split on tabulation chars

* Use a regex because the splitlines algo split on tabulation chars

* Added to changelog

* Added to changelog

* Use a pre-compiled regex

Co-authored-by: Bruno Cabral <bruno@potelo.com.br>
  • Loading branch information
bratao and Bruno Cabral authored Jun 10, 2020
1 parent 2012fea commit f4d330a
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- To be consistent with PyTorch `IterableDataset`, `AllennlpLazyDataset` no longer implements `__len__()`.
Previously it would always return 1.
- Removed old tutorials, in favor of [the new AllenNLP Guide](https://guide.allennlp.org)
- Changed the vocabulary loading to consider new lines for Windows/Linux and Mac.

## [v1.0.0rc5](/~https://github.com/allenai/allennlp/releases/tag/v1.0.0rc5) - 2020-05-26

Expand Down
4 changes: 3 additions & 1 deletion allennlp/data/vocabulary.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import copy
import logging
import os
import re
from collections import defaultdict
from typing import Any, Callable, Dict, Iterable, List, Optional, Set, Union, TYPE_CHECKING

Expand All @@ -27,6 +28,7 @@
DEFAULT_PADDING_TOKEN = "@@PADDING@@"
DEFAULT_OOV_TOKEN = "@@UNKNOWN@@"
NAMESPACE_PADDING_FILE = "non_padded_namespaces.txt"
_NEW_LINE_REGEX = re.compile(r"\n|\r\n")


class _NamespaceDependentDefaultDict(defaultdict):
Expand Down Expand Up @@ -443,7 +445,7 @@ def set_from_file(
self._token_to_index[namespace] = {}
self._index_to_token[namespace] = {}
with codecs.open(filename, "r", "utf-8") as input_file:
lines = input_file.read().split("\n")
lines = _NEW_LINE_REGEX.split(input_file.read())
# Be flexible about having final newline or not
if lines and lines[-1] == "":
lines = lines[:-1]
Expand Down

0 comments on commit f4d330a

Please sign in to comment.