Update vocabulary load to a system-agnostic newline (#4342)

* Update vocabulary load to a system-agnostic newline Hello, I had a problem about training a model on a Linux machine and loading on a Windows machine. The error was: AssertionError: OOV token not found! After some debugging I found out that during the vocabulary loading, it was splitting by '\n', where this can cause a difference between Linux and Windows. This PR change the split to OS agnostic method of new-line splitting. * Use a regex because the splitlines algo split on tabulation chars * Use a regex because the splitlines algo split on tabulation chars * Added to changelog * Added to changelog * Use a pre-compiled regex Co-authored-by: Bruno Cabral <bruno@potelo.com.br>
allenai · Jun 10, 2020 · f4d330a · f4d330a
1 parent 2012fea
commit f4d330a
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -48,6 +48,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - To be consistent with PyTorch `IterableDataset`, `AllennlpLazyDataset` no longer implements `__len__()`.
   Previously it would always return 1.
 - Removed old tutorials, in favor of [the new AllenNLP Guide](https://guide.allennlp.org)
+- Changed the vocabulary loading to consider new lines for Windows/Linux and Mac.
 
 ## [v1.0.0rc5](/~https://github.com/allenai/allennlp/releases/tag/v1.0.0rc5) - 2020-05-26
 

diff --git a/allennlp/data/vocabulary.py b/allennlp/data/vocabulary.py
@@ -7,6 +7,7 @@
 import copy
 import logging
 import os
+import re
 from collections import defaultdict
 from typing import Any, Callable, Dict, Iterable, List, Optional, Set, Union, TYPE_CHECKING
 
@@ -27,6 +28,7 @@
 DEFAULT_PADDING_TOKEN = "@@PADDING@@"
 DEFAULT_OOV_TOKEN = "@@UNKNOWN@@"
 NAMESPACE_PADDING_FILE = "non_padded_namespaces.txt"
+_NEW_LINE_REGEX = re.compile(r"\n|\r\n")
 
 
 class _NamespaceDependentDefaultDict(defaultdict):
@@ -443,7 +445,7 @@ def set_from_file(
             self._token_to_index[namespace] = {}
             self._index_to_token[namespace] = {}
         with codecs.open(filename, "r", "utf-8") as input_file:
-            lines = input_file.read().split("\n")
+            lines = _NEW_LINE_REGEX.split(input_file.read())
             # Be flexible about having final newline or not
             if lines and lines[-1] == "":
                 lines = lines[:-1]