-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared data synchronization #7618
Comments
The change is not rejected, there is a merge. In case user1 modifies fieldA and user2 modifies fieldB, both updates go in. If both modify fieldA, there is the "resolve conflicts" dialog shown. Similar to git, if one modifies the same line of code. One could implement something more intelligent here, if domain-specific knowledge is used. For instance, if an author is added by user1 and user2, the update algorithm could just add the two authors.
No. See above. JabRef implements the above merge algorithm. It does not work well if two persons are working on the same field at the same time: The user always gets the merge entries dialog presented.
We implemented that in our Postgres setting.
This is, more or less, optimistic offline lock.
Stick with optimistic offline lock. It seems it works very well of your described cases. Furthermore, it is very simple to implement. It works both in a "real time" data synchronization and also in an offline setting. |
Thanks @koppor, apparently I misread the code a bit. So what happens if the user is offline, changes something, and later goes online. When is the version counter incremented, and how do you recognize that there were changes to the local database in the last session that need to be pushed to the server? |
Side comment: We can try the effects with a shared Postgres instance. See https://docs.jabref.org/collaborative-work/sqldatabase#try-it-out for details. I assume the tool is not crashing. (If we discuss crashing, we need a local database storing the local id). I hope, I recall everything correctly; otherwise, I would need to buy the POSA book). The secret sauce is that the client stores the version number of the record when it received the record from the server (call it
In case another client pushed inbetween, the "current number of server" is not equal to "
|
All records need to be updated where |
Small note: Back in 2016, we did not have the concept of ADRs at hand. Everything discussed was between @obraliar and me. This synchronization concept was developed in the "StuPro 2015/2016". - It just came into my mind that the key/value field storage is one consequence out of that: With having a version on each field, JabRef knows whether the field was updated remote. Thus, we know whether we have to update a field. Thus, it is possible to work on the same entry. Since we did not implement "smart merge" on a field basis, one gets the "merge entries" dialog if two users concurrently work on one field. Note that Postgres and Oracle support the life synchronization only (because they offer a pub/sub system). MySQL does not offer pub/sub. Thus, the user has to poll changes regularly in the case of MySQL. |
Thanks Olly. I think I understood the algorithm. But I have trouble seeing how it works if not all changes are immediately pushed to the server. Start with some entry, at version = 0. Say you are offline and remove a field, close JabRef (version is still 0 since it was not yet transmitted to the server, right?). In the meantime someone else modifies the entry on the server, so that there version = 1. Now you start JabRef in online mode. The server notifies you about the update, and the algorithm recognizes that 0 = local.version < server.version = 1, so the server version of the deleted field is used (right?). But the "correct" behavior would be to delete the deleted field and merge the other changes. So how does one recognize that locally the field was removed and not that the server version had the field added? The easiest solution (and if I understood it correctly, then one can actually proof that it is the minimal data structure doing this job) is to keep track of the local version as well, i.e. have an additional counter for each node that is increased when the local replica is modified. This is exactly the version / vector clock approach suggested above. This would be only a small modification of the offline lock algorithm to gracefully handle offline scenarios as well. |
I thought a bit more about it and came to the realization that the "vector clock" solution is a bit more complicated than what we need. If we have a star architecture, then every sync goes via the central server. Thus, a client doesn't need to know the local version of another client (which is what the vector clock gives). Instead, it suffices for the client to keep track of local updates relative to the server version. Thus my new proposal would be:
Then the sync operations can be implemented as follows:
In addition, the server, local and client version number need to be updated accordingly after each of these operations.
If client and local version numbers are stored in the entry, then A7 is supported. However, manual changes to the bibtex file are not supported (except if the user also increases the local version number). What do you think @koppor. |
What you are explaining in the table, is the optimistic offline lock. (Assumption: "local version" = ".bib" file, Then, it is easy, too) As soon as the "client" is started, it knows the version of the .bib file when it was last synched. Then, it loads the local bib file. In case of any differences, it increases the version number. Then, it starts synchronizing with the server. Cold start: The client is not aware of anything. In case the bib file already exists on the server, a full comparison has to be made. Please note that we put the version number on fields thus, users can edit the same entry. The merge-entries-dialog only pops up if they touch the same field. |
I found our implemenatation: org.jabref.logic.shared.DBMSSynchronizer#synchronizeLocalDatabase, called from org.jabref.logic.shared.DBMSSynchronizer#pullChanges. The tests are in org.jabref.logic.shared.SynchronizationTestSimulator. Oh, wow, with that in mind, one has "just" to implement |
You are right, the algorithm is very similar to the optimistic lock. However, there you only have one "version" that is incremented upon change and used for the comparison. This works perfectly when you always have connection to the server and can directly push changes. However, when you are offline and change an entry this may lead to problems, as described above:
That's why I proposed to have a "local version" that keeps track of unsynced local changes (you can see this as another optimistic lock).
Keeping a copy of the last synced state would be another solution for this issues, indeed. Is this already implemented? Thanks also for the input of field vs entry sync. Need to think about this in more detail. What are the advantages of the field sync in your opinion? |
The client has to be smart enough to note that it locally changed something. This is what you described by "keeping a copy of the last synced state". I agree, we don't keep the copy. Since we thought about these issues, we found out that BibTeX data is key/value object. key1 could be authors, key2 journal, etc. - Thus, we can put versions into fields! This also answers your question:
If the server touches field1 and we modify field2, we can easily (!) be aware of these changes. If we do not do that and version the complete entry as whole, then we ran into issues as you describe. Then, the local view is absolutely necessary to correctly merge the updates.
With version on the field and notifications that fields are added and removed.
We somehow tried to avoid the local extra copy, but I agree its necessary if one really works offline. I would store a synchronization state instead of the plain data view of received of the server. field1:
serverVersion: 1
clientVersion: 1
field2:
serverVersion: 1
clientVersion: 2 Then, each sync attempt can use this information. If the server version changed, it needs to be fetched and merged accordingly. I thought, this is easier than a BibTeX-Diff. - However, a BibTeX diff can IMHO produce similar data. For @obraliar and me, the field versions seemed to be easier to implement and test. - We found a proper BibTeX diff harder than a field diff. In the case of overleaf, I would do a BibTeX diff. Based on that, I would create a field diff and set the version number of the fields accordingly. Based on that, I would do the sync of the local state with the server state. That would be more easy for me, as I find there is more control on each phase (diffing -> handling of field diffs) |
Thanks again. I thought a bit more about the per-field vs per-entry versioning system. Disadvantages I see with per-field are as follows:
Also, if I understand the code correctly, it currently only supports versions per entry anyway.
Do I miss something here? |
(Notes from DevCall 2021-06-07) TLDR: Version bibtex entry, make merge server-side in one transaction --> offer options to user Where to store version?
VersioningOptimisitc offline lock"Relative time" (see also vector clocks)
Timestamp-based sync
Personal notes:
No -> server-side processing.
Fuck :)
Can't be true. Code lies. :) |
I finally found a bit more time to think about this. Olly had a very good point during the DevCall pointing out that users can also edit entries outside of JabRef, and this needs be picked up by JabRef before it syncs (so the scenario is: user closes jabref > edits bib file manually > starts jabref). Regardless of where we save the local version number (in the entry or in a separate file next to the entry), the user edits don't trigger an update of the version number and thus the sync algorithm gets confused. This means we need a reliable way to make sure the local version is increased also when the user changes the bib file. My proposal for this is as follows: In a separate file (called "sync file" in the following), we save the following:
Once JabRef starts (or more exactly loads the bib file) and there is a sync file, then JabRef will compute the hashes for all entries and compares them with the hashes in the sync file. If the hashes coincide, then we do nothing. But if the hashes are different, the user changed the entry manually and we increase the local version number. Only after this update progress, the sync with the server is initiated. @JabRef/developers What do you think? I'll update the issue description with the new proposal. |
Does this mean we need keep two files in the future for one library - a library file and a hash file? This would be probably the worst solution imho. |
One could also store the hash in the entry itself. But why do you think a separate file is not a good solution? There are certain advantages of storing this externally, like not polluting the entry with sync-related data (makes it easier to send the entry to other people) and users cannot accidentally delete/modify the sync-data making everything more robust. On the other hand, it shouldn't be too hard to store the file out of sight of the user (say in %AppData%) so that they don't even notice we keep this data. |
My concern was regarding the portability of the libraries across different PCs. Would you have to share the hash files then too? |
No, the user doesn't need to copy these hash files and they are solely managed by JabRef. However, if the database is sync externally as well (say using git or dropbox), there is a small performance advantage to sync the hashed file as well (which can be taken as an argument for embedding the hash in the entry itself). For example, consider the following scenario (with no sharing of the hashed file):
If the hash file are shared, then the changes via the git sync are not recognized as external changes and no unnecessary server query is invoked. That's the only small advantage I see for storing the hashed entry in the entry itself. |
Yes, that was more or less what I meant. 😅 |
With the upcoming implementation of an online version for JabRef, we face the issue of synchronizing data across multiple devices (and eventually different users).
Setting
Assumptions
JabRef's requirements are summarized as follows:
Algorithm:
For each entry, we store the following data on the client side:
On server side, the "shared id" and a "server version number" is stored.
Then the sync operations are implemented as follows:
As an example:
Details and overview of (different) options
Lock systems
One node acquires a lock to write changes. All other write changes that happened during the same time are rejected or marked as conflicts that need be resolved by the user. Here we should favor Optimistic Offline Lock (always allow users to make changes, but commit to server requires lock) over pessimistic locks (only allow writing when one has a lock) since the chance for conflicts is low (A1) and users should always be able to make edits even when offline (A4). Optimistic offline lock is the approach we use currently for for sync with shared databases. In particular, we have a "version" counter that is incremented upon every write process. If the user's version counter is smaller than the remove version counter, then this is an indication that there was an additional update to the remote database in meantime and the local change is rejected.
The current implementation may run into problems when the server is offline (if I'm not mistaken). For example, suppose that two users have the same entry with version counter zero. If user 1 changes the entry while being offline, and the server version is updated by user 2 in the meantime (thus version 2 counter is incremented), then version 1 < version 2 indicating that the entry of user 1 needs to be replaced by the remote version. But this would overwrite the changes of user 1.
In general, I have the impression that optimistic locks are great for handling concurrent transactions (two users trying to update the same entry at the same time), but has shortcomings when used as a data synchronization protocol.
Optimistic replication
Locking (in the way we use it) ensures that there is essentially only one valid replica (the server version) and all user copies are almost immediately synced, leading to essentially one copy of the data across multiple nodes. In contrast, optimistic replication accepts that for some short time there might be divergent replica that only eventually converge to an end state. Having divergent versions of data seems to be a realist scenario as users may work offline for some time (A6). The goal of optimistic replication is to have eventual consistency, which should be sufficient for our purposes.
There are different options varying along different axes.
Operations
In principle, we have easily access to the change operations and implemented listeners already. However, sending these change operations to the server would require that the user is online. In the case that the user is offline we would need to write a changelog to the bib file, which would be sent to the server once the user goes online again (and upon successful transfer to the server the changelog is reset again). Using an external store does not work because then the bib file is not enough to replay the changes, violating A7. In the same vain, when the user edits the bib file manually, we don't have any direct operation information (A8). For these reasons, we definitely need to support state-transfer at least as a fallback.
Conflict-raising
Syntactic policies are simpler and generic but cause more conflicts, whereas semantic policies are more flexible but also more complicated and application-specific.
Ignoring conflicts is not an option in my opinion due to A3. I would mainly use with a mostly-synatic policy, enriching it with simple semantic rules. For example, additions of new different fields shouldn't result in conflicts. But I wouldn't try to analyze the user's intent if both edit the same field. This marks some changes as conflicts that could in principle be reconciled, eg starting from "A is B", user 1 changes "A" to "JabRef" and user 2 changes "B" to "great" could be resolved to "JabRef is great". A border case are special fields, e.g. two keywords are added or readstatus is changed to "skimmed" and on another device to "read". Proposal: start with synatic and enrich it by semantic rules based on user feedback.
Propagation: Network Topology
The main communication path should be via the central JabRef server, resulting in a star topology. However, we also need to support the case where the user syncs the database manually (e.g. by using git).
Propagation: Degree of synchronicity
Using GraphQL subscriptions, it should be straightforward to implement a broadcast system for updates (while the user is online). Supplemented by a pull after times of being offline (e.g. JabRef start, connection loss).
Maintaining a "happens-before" relationship
Vector clocks become problematic when the number of nodes becomes large (not an issue for us in the near future) or is dynamic (definitely an issue for us). For the latter see e.g Dynamic Vector Clocks for Consistent Ordering of Events in Dynamic Distributed Applications.
Nice overview of many variations of vector clocks.
Old proposed algorithm
Every node has a vector clock (for each entry) and increments its own time once the entry is modified. Every update message from the server include the vector clock, which is then used to determine the merge strategy by comparing it with the local vector clock:
When the user reconnects and wants to update (pull sync), then she sends all (shared) ids of entries and the local vector clock to the server. The response will be all entries with more recent times (in any of the clocks).
This is based on SData description.
One question I still have is how to store the vector clock (in the entry?) and name the nodes (user + device name?).
Other approaches:
References:
The text was updated successfully, but these errors were encountered: