Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Cleanup for copying over physcial review article id as the page n… #7025

Merged
merged 11 commits into from
Oct 20, 2020
24 changes: 24 additions & 0 deletions src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import java.util.Collections;
import java.util.List;
import java.util.Optional;
import java.util.regex.Pattern;

import org.jabref.logic.cleanup.FieldFormatterCleanup;
import org.jabref.logic.formatter.bibtexfields.ClearFormatter;
Expand All @@ -22,6 +23,7 @@
import org.jabref.model.entry.BibEntry;
import org.jabref.model.entry.field.StandardField;
import org.jabref.model.entry.identifier.DOI;
import org.jabref.model.entry.types.StandardEntryType;
import org.jabref.model.util.DummyFileUpdateMonitor;
import org.jabref.model.util.OptionalUtil;

Expand Down Expand Up @@ -75,6 +77,14 @@ public Optional<BibEntry> performSearchById(String identifier) throws FetcherExc
fetchedEntry = BibtexParser.singleFromString(bibtexString, preferences, new DummyFileUpdateMonitor());
fetchedEntry.ifPresent(this::doPostCleanup);

// Check if the entry is an APS journal and add the article id as the page count if page field is missing
if (fetchedEntry.isPresent() && fetchedEntry.get().hasField(StandardField.DOI)) {
BibEntry entry = fetchedEntry.get();
if (isAPSJournal(entry, entry.getField(StandardField.DOI).get()) && !entry.hasField(StandardField.PAGES)) {
setPageCountToArticleId(entry, entry.getField(StandardField.DOI).get());
}
}

return fetchedEntry;
} else {
throw new FetcherException(Localization.lang("Invalid DOI: '%0'.", identifier));
Expand Down Expand Up @@ -123,4 +133,18 @@ public Optional<String> getAgency(DOI doi) throws IOException {

return agency;
}

private void setPageCountToArticleId(BibEntry entry, String doiAsString) {
String articleId = doiAsString.substring(doiAsString.lastIndexOf('.') + 1);
entry.setField(StandardField.PAGES, articleId);
}

private boolean isAPSJournal(BibEntry entry, String doiAsString) {
if (!entry.getType().equals(StandardEntryType.Article)) {
return false;
}
Pattern apsJournalSuffixPattern = Pattern.compile("([\\w]+\\.)([\\w]+\\.)([\\w]+)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if it would make sense to check for the strig "phys" as well?

It's better to etract the Pattern to a static string, otherwise you won't get the advantages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be sure that every Physical Review journal contains that string? For example, this aps doi doesn't contain the string "phys". https://doi.org/10.1103/PRXQuantum.1.010001
Here is a list of all journals
https://journals.aps.org/browse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best is to check for the string/number 10.1103/ It seems all DOIs from APS are prefixed with that.
Otherwise the regex is too broad and would capture a lot of non related things and would create invalid data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, a regex isn't really needed because in the string 10.1103, 1103 denotes the organization (APS in this case). That's really all that has to be checked if I am not mistaken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still would use the regex as you can test that the doi is of the right format, so that you really know that the last number is the page number. And it's also a bit easier to understand than the manual parsing using isDigit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we

  1. use the organization id to check if the entry is an aps journal
  2. use the regex to check if the doi is of the right format (https://doi.org/10.1103/[journal].[volume].[articleID])
  3. set the page field if 1 and 2 are true

String suffix = doiAsString.substring(doiAsString.lastIndexOf('/') + 1);
return apsJournalSuffixPattern.matcher(suffix).matches();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ public class DoiFetcherTest {
private BibEntry bibEntryBurd2011;
private BibEntry bibEntryDecker2007;
private BibEntry bibEntryIannarelli2019;
private BibEntry bibEntryStenzel2020;

@BeforeEach
public void setUp() {
Expand Down Expand Up @@ -68,6 +69,20 @@ public void setUp() {
.withField(StandardField.JOURNAL, "Chemical Engineering Transactions")
.withField(StandardField.PAGES, "871-876")
.withField(StandardField.VOLUME, "77");
bibEntryStenzel2020 = new BibEntry();
bibEntryStenzel2020.setType(StandardEntryType.Article);
bibEntryStenzel2020.setCitationKey("Stenzel_2020");
bibEntryStenzel2020.setField(StandardField.AUTHOR, "L. Stenzel and A. L. C. Hayward and U. Schollwöck and F. Heidrich-Meisner");
bibEntryStenzel2020.setField(StandardField.JOURNAL, "Physical Review A");
bibEntryStenzel2020.setField(StandardField.TITLE, "Topological phases in the Fermi-Hofstadter-Hubbard model on hybrid-space ladders");
bibEntryStenzel2020.setField(StandardField.YEAR, "2020");
bibEntryStenzel2020.setField(StandardField.MONTH, "aug");
bibEntryStenzel2020.setField(StandardField.VOLUME, "102");
bibEntryStenzel2020.setField(StandardField.DOI, "10.1103/physreva.102.023315");
bibEntryStenzel2020.setField(StandardField.PUBLISHER, "American Physical Society ({APS})");
bibEntryStenzel2020.setField(StandardField.PAGES, "023315");
bibEntryStenzel2020.setField(StandardField.NUMBER, "2");

}

@Test
Expand Down Expand Up @@ -108,4 +123,10 @@ public void testPerformSearchNonTrimmedDOI() throws FetcherException {
Optional<BibEntry> fetchedEntry = fetcher.performSearchById("http s://doi.org/ 10.1109 /ICWS .2007.59 ");
assertEquals(Optional.of(bibEntryDecker2007), fetchedEntry);
}

@Test
public void testAPSJournalCopiesArticleIdToPageField() throws FetcherException {
Optional<BibEntry> fetchedEntry = fetcher.performSearchById("10.1103/physreva.102.023315");
assertEquals(Optional.of(bibEntryStenzel2020), fetchedEntry);
}
}