Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark up text for automated string extraction #1477

Open
16 of 28 tasks
emmajclegg opened this issue Jul 12, 2024 · 20 comments
Open
16 of 28 tasks

Mark up text for automated string extraction #1477

emmajclegg opened this issue Jul 12, 2024 · 20 comments
Assignees
Labels
Multilingual Interface ODS Issue initiated by ODS
Milestone

Comments

@emmajclegg
Copy link
Collaborator

emmajclegg commented Jul 12, 2024

This is a first step to translating IATI Publisher's interface into French and Spanish, following the approach discussed here: #1420

YI will prepare text for automated extraction from IATI Publisher, ODS will review and get it translated, then YI will reintegrate text back into IATI Publisher.

Tasks

  • Understanding the requirement for automated string extraction
  • Create an API for updating the locale language #1560
  • Create API for sending the translated texts to the FE #1562
  • Create UI for updating the system language
  • Add placeholders in public pages and add english texts from those pages in php files with proper folder structure
  • Write a script to read the language files and download xls file in 3 columns for English, French, Spanish
  • Upload the translated files in the language files by reading the XLS file
  • Test the public pages for translation
  • Add placeholders in remaining pages, common buttons and notifications and add english texts from those pages in a json file
  • Upload the translated files in the language files by reading the XLS file
  • Complete the translation of the system
  • Document how to add translation for future development and changes

Test extraction by modules

@emmajclegg emmajclegg added the ODS Issue initiated by ODS label Jul 12, 2024
@emmajclegg emmajclegg self-assigned this Jul 12, 2024
@emmajclegg emmajclegg changed the title (Placeholder) Mark up text for automated string extraction Mark up text for automated string extraction Jul 15, 2024
@emmajclegg emmajclegg added Translation Work to provide a multi-lingual IATI Publisher interface and removed Translation Work to provide a multi-lingual IATI Publisher interface labels Jul 29, 2024
@emmajclegg emmajclegg added this to the Multilingual interface milestone Jul 29, 2024
@emmajclegg emmajclegg assigned praweshsth and unassigned emmajclegg Jul 30, 2024
@robredpath
Copy link
Collaborator

Following our conversation this week, I wanted to share an outline of the process that you'll need to follow.

It's important that this is an automated process that is part of your standard workflow, so that any changes to text are detected and translated quickly in the future.

In my experience, this is usually achieved by marking up the text in some way. See, for example, this Django template from one of our projects, which wraps each sentence in {% blocktrans %} tags which signal to Django's i18n module that the string should be included in the translation process. Our workflow uses .pot/.po files.

I can see that there's already some files with lists of strings that look like a translation mechanism, which may be how this is already starting to be implemented? My instinct is that this is potentially quite a fragile and high-effort way to work, but it's up to you!

Either way, once the system is in place, then each time we make an update, the process is:

  • Extract strings
  • Package strings that have changed in a .pot file
  • Send to translators
  • Wait
  • Receive translated strings as a .po file
  • Re-integrate strings into the software

Apart from a few manual steps to authorise the translation and review the software output to make sure that nothing went wrong, this is an entirely manual process.

I don't think that IATI Publisher necessarily has to have a .pot/po file - based process, but if you were to build one then it would be very close to what we're going to need once we work out the file details with the translation company.

Does that make sense? I'm very happy to provide any more detail if it's useful.

@emmajclegg
Copy link
Collaborator Author

Thanks @robredpath - I assume you meant "Apart from a few manual steps to authorise the translation..., this is an entirely automated process" ? Otherwise, no questions from me

@praweshsth praweshsth modified the milestones: Multilingual interface, August 2024 Aug 5, 2024
@praweshsth praweshsth assigned Sanilblank and unassigned praweshsth Aug 5, 2024
@Sanilblank
Copy link
Collaborator

Hello @robredpath
Based on the template you have provided, it seems that you are trying to mention the localization mechanism of Django framework. For the translation process in IATI Publisher, when we initially started the process, we used a similar approach. Laravel, the framework used for the backend development, also provides a similar approach for localization and if the templating engine of laravel had been used for the frontend part as well, the template would look very much similar to what you have given as a template. However, since we are using Vue js, a little bit more complexity is added for displaying the text in the frontend side even though a similar templating structure is used by Vue as well.
Similar to how you have mentioned the use of .pot/.po files, we (in laravel) also use similar files that contain array of data for saving the translations in different files as required. The process and mechanism for both is quite similar, so I don't think we will need to go through the process of incorporating .pot/.po files directly into the system as it will require a lot more research on how it can be achieved. The translations will be stored in files within the system but since providing those files directly to you for translation could cause confusion about how to process them as they fall more into the technical category, we were thinking of writing a script that would take all the strings and place them in an excel file which would be provided to you and then you would add the translated strings and provide the file back to us, and finally a script to take those translations and put them in the format required by the system.
If you will required .pot/.po files for us to send the strings to be translated, we will need a bit of time to research how the files can be created, and then the rest of the process will be similar i.e. a script will generate the file which will be sent to you, you will add the translations and send the file back to us, and a script will take the translations and put them into the system.
We are a bit confused on the use of the word 'automated' in your title and description.
By automate I am assuming you mean that a file will contain all the english strings present in the system and if any changes occurs in the file, a script will detect the change and then generate the required excel or .pot/.po file which will be sent to you for translation. If my understanding correct? If it is not, could you provide a bit more explanation for this part.
Hope I have made things clear here, if there is something that seems a bit confusing, I would be glad to go a bit deeper in the explanation.

cc. @praweshsth @PG-Momik

@emmajclegg
Copy link
Collaborator Author

Thanks for the information here @Sanilblank . @robredpath is away until Aug 29th unfortunately, but I will see if anyone else in our team can help with the file format question in the meantime.

We are a bit confused on the use of the word 'automated' in your title and description.
By automate I am assuming you mean that a file will contain all the english strings present in the system and if any changes occurs in the file, a script will detect the change and then generate the required excel or .pot/.po file which will be sent to you for translation. If my understanding correct?

Yes, that's correct to my understanding. We remain in control of how often and when we run the re-translation, but your system should be capable of detecting what English text has and hasn't changed since the last translation.

By the extraction and re-integration of text into IATI Publisher being done in an automated way, we mean via a script as opposed to any manual copy and pasting.

@Sanilblank
Copy link
Collaborator

@emmajclegg Thanks for the clarification.
I have another question regarding the extraction part. The system will generate the excel or another format file containing the english strings to be extracted which will be sent yo you. We will need a way to inform you about which strings have been added/changed since the last translation was done. So, we could do this in many ways, the first being generating only the strings which have been either added or updated (which will require translation) in the file, another way could be generating all the string in the file along with previous translations as well which will allow you to update the translations even for the ones which were already done previously. The second approach would give you more flexibility but it may be more difficult for you to see which texts actually require translating.
If you have any other ideas regarding this subject, we are open to hear them. Please have a look and confirm the process which we should move forward with.

cc. @praweshsth @PG-Momik

@emmajclegg
Copy link
Collaborator Author

Hi @Sanilblank - I don't want to give wrong information on this so will check with @robredpath once he's back (Aug 29th) and update here.

@robredpath
Copy link
Collaborator

Hi @Sanilblank! Thanks for this - it's really useful to understand what you're thinking.

The exact format of the files doesn't really matter too much for us - I suggested .pot/.po files as they're fairly standard in our other applications and are straightforward to work with, but an .xlsx file would also be fine. The main thing is that the process is automated and repeatable.

By "automated" what I mean is that we expect the list of strings to be generated directly from the source code by software, without any manual steps - and for the translated strings to similarly be re-integrated automatically. This means that the process is easily repeatable, so that a small update can be made easily and large updates aren't too much of a problem.

We don't expect the automation to require zero human contact, but we want to make sure that everything gets translated as part of the regular updating process for the software: every time a form or button changes, or we add some new explanatory text, it should be translated promptly.

By way of example, for our documentation platform we run one command to generate the .pot files that we send to the translators, and then we check in the translated files to git and re-run the build process to generate the multi-lingual website. This gives us a very high level of repeatability and consistency, and it's easy for us to do which encourages us to do it often - even for very small changes.

In our documentation work we send the whole documentation site each time, and the translation platform figures out what's changed, and gets that translated. We then re-import the whole translated file back in. Our experience is that it's easier that way, rather than trying to manage lists of things that have changed. Ultimately, it is up to you, but that's our experience and recommendation.

Hope that helps - do let me know if you have any further questions

@emmajclegg emmajclegg modified the milestones: August 2024, September 2024 Sep 2, 2024
@Sanilblank
Copy link
Collaborator

Hi @robredpath
I think I understand what you are trying to say and I feel that we are on the same page regarding the automation process. As mentioned previously, we will be writing a script which will be responsible for checking the translations maintained in the system and will be able to generate an excel file consisting of all texts either translated or requiring translation which will be sent to you. You will perform the translations as required and will send the file back to us and we will use a script to simply take the translations and insert them into the system.

cc. @praweshsth @PG-Momik

@Sanilblank
Copy link
Collaborator

Hi @robredpath
The above parts discussed between us are very clear now. For the part where we send the data present in the backend to the frontend, we researched online and found two methods.

  1. The entire data is loaded in the app.blade.php file and saved as global data. Then, the FE uses the data for showing the texts throughout the system. This increase the load in each page, so we could put this in cache and then use the data everywhere. Still, the data stored in cache would be very high and so this process may not be feasible. Also, when the language is changed by the user, the cache data is delete.
  2. The BE will have apis for sending the translated texts to the FE. When a page loads, the apis for the required translated texts will be called. When an api for a certain set of translations is called, they will be store on the backend cache in redis so that next time no processing part is required. When a user changes the language, the cache data will be deleted. This process will not send all the translated texts data to the FE immediately which will help to reduce the load. We are leaning towards using this method.

This message is just to update you regarding the findings we have had and to give you an update about how we are proceeding for this feature.

cc. @praweshsth @PG-Momik

@BibhaT BibhaT modified the milestones: October 2024, November 2024 Nov 11, 2024
@emmajclegg
Copy link
Collaborator Author

emmajclegg commented Nov 13, 2024

@Sanilblank - can you update on how this automated translation work is going please? We'd committed to implementing French & Spanish translation by the end of this year so I'm worried about delivery timelines slipping.

I'm keen to review the extracted English text (the step before getting it translated) as soon as possible, so I discussed briefly with @BibhaT & @PG-Momik yesterday the possibility of you sharing the text you've extracted so far, rather than waiting until all user-facing text has been incorporated. This would mean I could start reviewing earlier.

Any blockers or questions, let me know.

I'm also trying to close old IATI Publisher issues that are no longer relevant - let me know if it's ok to close #885 and #1279? These were the previous issues related to translation that I assume have been replaced by this current issue.

@Sanilblank
Copy link
Collaborator

@emmajclegg Hello Emma
I have been informed about performing the translation in sections instead of performing everything at the same time.
I will be extracting the translation texts for the public pages and will send them to you by Wednesday at the latest. Once you confirm the format for the excel and the flow for the public pages is good, the remaining parts will be completed.

I'm also trying to close old IATI Publisher issues that are no longer relevant - let me know if it's ok to close #885 and #1279? These were the previous issues related to translation that I assume have been replaced by this current issue.

Yes those issues can be closed.

cc. @PG-Momik @BibhaT

@emmajclegg
Copy link
Collaborator Author

To summarise from this morning's call,

We don't mind what order text is extracted from different modules of the system - @PG-Momik suggested choosing a "simple" module to start. To reconfirm, only user-facing interface text and messages will need translating, nothing that only super-admins can see.

The public-facing pages and registration workflow text was extracted by @Sanilblank last month. I'm summarising our feedback below from the email conversation:

  • can common strings, that appear multiple times in the IATI Publisher interface, be reused in the code so that we are not translating them multiple times. This will help ensure similar text is translated consistently across the tool and reduce the time for translation. Note, if common strings appear as part of a wider sentence, then we should not split the sentence up.
  • let's avoid complex HTML in the extracted text as far as possible. Simple HTML, like basic links or styling tags, is fine but anything more complicated could affect translation. Removing line breaks was suggested as a minimum way to simplify the more complicated HTML examples.
  • some sentences were split up if they contained an email address, for example. We have a glossary of terms that the translators should not translate, so it is preferable to leave email addresses / names / other variables in the extracted text to make translation of the entire sentence possible.
  • whitespace, punctuation or other symbols at the beginning of extracted text can easily get lost in translation so we suggest avoiding these.

We expect to go through several test-runs of the text extraction, review, translation & reintegration process to resolve small problems that come up. This will be necessary before we release the French & Spanish interface to end users.

@BibhaT @PG-Momik @Sanilblank - I'm aware this is a big task and I'm worried that we don't have a good handle on timelines (considering it was something we were aiming to complete by the end of the year). Aside from bugs and user support issues, this translation work takes priority over any new work in the "proposed user story / task list".

Any questions, please let me know.

cc' @robredpath

@Sanilblank
Copy link
Collaborator

Hello @emmajclegg ,
As Momik must have mentioned yesterday, I was on leave for some days because of health issues and did not have a chance to properly work on the translation tasks.
From the recent comment it is clear what changes need to be implemented based on the sample of the public pages translation which was sent previously.
The process of performing the translation now is mostly a manual task and requires the copying of the texts in different sections, pasting in the language files and replacing the original files with the placeholders. Since most of the remaining task is manual work, it is difficult to give an exact estimation about how long it will take to complete the entire system translation.
Since you have mentioned that this task takes priority, we will be giving more time and will be working towards finishing the tasks in a quick and effective manner. Also, if the estimations are required, we could finish a module like settings or organizations and then move towards creating an estimation based on the time taken to complete that one particular module.
Thank you,
Sanil Manandhar

cc. @PG-Momik @BibhaT

@emmajclegg
Copy link
Collaborator Author

Thanks @Sanilblank - I appreciate you've been off recently, that's no problem.

I don't need an exact estimation for this at this point - the main thing is I'd like us to be making visible progress on it, rather than letting it potentially drag on for months.

Just let me know when you've picked the module to work on first and when roughly I should be expecting to receive a text file to review (as it helps me plan). I expect we'll want to test run the entire extraction, translation, reintegration process on a single module first which, as you say, will help us all understand time and effort required for the remaining ones.

@emmajclegg
Copy link
Collaborator Author

emmajclegg commented Jan 7, 2025

To summarise where I think we got to on the questions from today's call:

  • IATI Standard element names - if these appear as part of a sentence in IATI Publisher, we will extract and translate the whole sentence (I think this makes sense for users' understanding). If the element names are standalone in the interface, we will still extract the strings but are unlikely to translate these into French and Spanish in the short term. We may translate them eventually, so want to keep this as an option.

  • IATI codelists - ODS are likely to translate IATI-maintained codelists into French & Spanish, but not externally maintained ones (e.g OECD DAC lists). If and when we do get the codelists translated, we want to make sure IATI Publisher can display these. We worked on syncing codelists in issue Check use of updated IATI codelists #1407 , so I believe IATI Publisher does detect automatically when IATI Standard codelists change ?

  • Import templates & PDF guidance files - we should prioritise translation of the IATI Publisher interface. Import templates and guidance are being looked at in separate issues, so are likely to change and translation can come later.

Any other questions, or anything to add, just let us know @PG-Momik @BibhaT

cc' @robredpath

@emmajclegg emmajclegg modified the milestones: December 2024, January 2025 Jan 7, 2025
@PG-Momik
Copy link
Collaborator

PG-Momik commented Jan 8, 2025

@emmajclegg Adding to this. On the call I mentioned that I wasn't sure if other codelist were besides OrganizationRegistrationAgency being sync'd. I've confirmed that the codelist are being synced as well. 👍

@PG-Momik
Copy link
Collaborator

PG-Momik commented Jan 8, 2025

@emmajclegg A sheet has been shared with the current extracted contents for the completed modules. Please have a look.

cc: @BibhaT

@emmajclegg
Copy link
Collaborator Author

emmajclegg commented Jan 8, 2025

Ok thanks a lot @PG-Momik - I'll have a look over the next few days, prioritising the sheets labelled green in the extracted text spreadsheet.

One question - I see that a few standalone IATI-specific strings like "IATI organisation identifier", "publisher ID" and "default language" are appearing in the sheets multiple times, though the key field is similar in each case. Can you clarify if there's already been an attempt at deduplication here? I'm wondering how we avoid translating these important strings multiple times.

cc'ing @robredpath for info. Rob - let me know what you think we need to run a first test of the translation and reintegration loop. "adminHeader" is the simplest sheet in that extracted text workbook, if useful as an easy example. Otherwise I'll let you know when there's a few sheets ready with de-duplicated and reviewed English text.

@emmajclegg
Copy link
Collaborator Author

@PG-Momik - to update, I've looked over the remaining green (i.e. nearly finished) sheets in the extracted text spreadsheet and have left a few more comments.

I haven't edited any of the English text yet as it sounds like it make sense for YI to re-extract an updated version of the spreadsheet before I do that (to save me re-doing the review before @robredpath sends the text for translation).

Happy to discuss any questions tomorrow.

@emmajclegg
Copy link
Collaborator Author

@PG-Momik - thanks for sharing the latest spreadsheet of extracted text (Extracted Sheet - Jan 17). I assume this was for me to have a look.

Again, I've left a few comments to check where certain text appears in the interface and flag a few areas where it could be simplified.

  • Cells highlighted orange - I'm questioning whether these contain user-facing text, as they look more like log/error messages.
  • Cell highlighted yellow - suspected duplication (I will have left a comment in the sheet)

Happy to discuss more on Wednesday (I'm not around tomorrow, Tues 21st)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Multilingual Interface ODS Issue initiated by ODS
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

8 participants