-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Cleanup for copying over physcial review article id as the page n… #7025
add Cleanup for copying over physcial review article id as the page n… #7025
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for you PR. Your idea seems to work nicely and the code looks good as well. I've only one remark about the location of the code, and a bit of fine tuning. Please also add a test, and have a look at the fetcher tests that are currently failing (because the fetcher now returns also page information).
/** | ||
* adds the article ID of a journal as the page count, but only if the page field is empty | ||
*/ | ||
public class PageFieldCleanup implements CleanupJob { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only used for the DOI fetcher, I would suggest to add this functionality as a private class method in DoiFetcher
instead of a new class.
if (doiAsString.isPresent() && !entry.hasField(StandardField.PAGES)) { | ||
String articleId = new String(); | ||
int index = doiAsString.get().length() - 1; | ||
while (Character.isDigit(doiAsString.get().charAt(index))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this issue only concerns articles from Physical Review, I would suggest to use a regex based on the format outlined in #7019 (comment). In particular, make sure it only applies to doi's to Physical Review, and not all dois ending on some number.
|
||
if (doiAsString.isPresent() && !entry.hasField(StandardField.PAGES)) { | ||
String articleId = new String(); | ||
int index = doiAsString.get().length() - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simply use a substring with lastIndexOf(.) to the the part after the last dot.
https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/lang/String.html#lastIndexOf(java.lang.String)
@@ -89,6 +90,7 @@ public String getName() { | |||
} | |||
|
|||
private void doPostCleanup(BibEntry entry) { | |||
new PageFieldCleanup().cleanup(entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good idea to put it here, because it would be called for every type of DOI, and the APS are only a very specific subset.
I would suggest adding it as a CleanupPreset Step:
/~https://github.com/JabRef/jabref/blob/1b35f8cb0040fdfb515974e78532598f07e11af2/src/main/java/org/jabref/logic/cleanup/CleanupPreset.java
and also adding it to the Cleanup Dialog maybe as Move article id to pages? (APS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it's the right place, the DOI fetcher is not returning the right/full information, so we improve this here. Of course, you are right and the extraction should only be applied for DOI's from APS (see my comment above).
Thanks for the review @Siedlerchr @tobiasdiez I am guessing it's an issue with the crossref website. |
if (!entry.getType().equals(StandardEntryType.Article)) { | ||
return false; | ||
} | ||
Pattern apsJournalSuffixPattern = Pattern.compile("([\\w]+\\.)([\\w]+\\.)([\\w]+)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking if it would make sense to check for the strig "phys" as well?
It's better to etract the Pattern to a static string, otherwise you won't get the advantages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be sure that every Physical Review journal contains that string? For example, this aps doi doesn't contain the string "phys". https://doi.org/10.1103/PRXQuantum.1.010001
Here is a list of all journals
https://journals.aps.org/browse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best is to check for the string/number 10.1103/ It seems all DOIs from APS are prefixed with that.
Otherwise the regex is too broad and would capture a lot of non related things and would create invalid data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, a regex isn't really needed because in the string 10.1103, 1103
denotes the organization (APS in this case). That's really all that has to be checked if I am not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still would use the regex as you can test that the doi is of the right format, so that you really know that the last number is the page number. And it's also a bit easier to understand than the manual parsing using isDigit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we
- use the organization id to check if the entry is an aps journal
- use the regex to check if the doi is of the right format (https://doi.org/10.1103/[journal].[volume].[articleID])
- set the page field if 1 and 2 are true
private boolean isAPSJournal(BibEntry entry, String doiAsString) { | ||
if (!entry.getType().equals(StandardEntryType.Article)) { | ||
return false; | ||
} | ||
Pattern apsSuffixPattern = Pattern.compile(APS_SUFFIX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move that line one as well to private static final
then it's good !
Thanks, looks good to me now! Don't forget to add a changelog entry for the new feature. |
Thanks! I hope you enjoyed the process, although may have been a bit confusing at times. Sorry for that. Looking forward to your next PR. And now MERGE 🚀 |
Fixes #7019
Added a cleanup to copy over the article id as the page number for APS journals. This only happens if the page number doesn't exist already.