Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GH-244]: Fixed issue "Status sync job exits if an error happens for a single user" by added additional logging #245

Merged
merged 10 commits into from
Mar 22, 2023

Conversation

Kshitij-Katiyar
Copy link
Contributor

@Kshitij-Katiyar Kshitij-Katiyar commented Jan 30, 2023

Summary

  • Users were getting the error in the server logs is an Error during user status sync job. err=not found during user status sync job. The status sync job was exiting if an error happened for a single user. So, I changed the logic to continue the process and log the error in server logs.

  • There will be at max 5 logs for a particular error/function.

  • Issue #244

… logging (#2)

* [MI-2664]:Fixed issue mattermost#244 on ms calender to add additional logging

* [MI-2664]:Added comment for skipping user

* [MI-2664]:Fixed unit test cases

* [MI-2664]:Fix lint errors

* [MI-2664]:Added conditional logging

* [MI-2664]:Added constants for limit

* [MI-2664]:Fixed review comments

* [MI-2664]:Fixed review comments
@mattermost-build
Copy link

Hello @Kshitij-Katiyar,

Thanks for your pull request! A Core Committer will review your pull request soon. For code contributions, you can learn more about the review process here.

Copy link
Contributor

@mickmister mickmister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kshitij-Katiyar Thanks for the quick turnaround on this 👍 Mostly looks good, just a few requests

server/mscalendar/availability.go Outdated Show resolved Hide resolved
server/mscalendar/availability.go Outdated Show resolved Hide resolved
server/mscalendar/availability.go Outdated Show resolved Hide resolved
server/mscalendar/availability.go Show resolved Hide resolved
server/mscalendar/availability.go Outdated Show resolved Hide resolved
… of mscalender plugin (#3)

* [MI-2687]:Fixed review comments given by mattermost team on issue mattermost#244 of mscalender plugin

* [MI-2687]:Fixed unit test cases

* [MI-2687]:Fixed review fixes
Copy link
Contributor

@mickmister mickmister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks @Kshitij-Katiyar!

@mickmister mickmister added the 3: QA Review Requires review by a QA tester label Jan 31, 2023
@DHaussermann
Copy link

@Kshitij-Katiyar one suggestion to consider here would be - Once the sync job completes, can it write to the logs how many users where sync and how many errored out? This would provide visibility and confirm that the the job is now attempting to sync the entire user base.

@Kshitij-Katiyar
Copy link
Contributor Author

Kshitij-Katiyar commented Feb 1, 2023

@Kshitij-Katiyar one suggestion to consider here would be - Once the sync job completes, can it write to the logs how many users where sync and how many errored out? This would provide visibility and confirm that the the job is now attempting to sync the entire user base.

sure @DHaussermann @mickmister We can add a summary after the job ends. Can you please also give suggestions of what are the things I should include in that? Some examples could be

  1. Can only show the number of users for each category like status synced-10 users and error-5 users.
  2. Along with the number we can show the mattermost userId for each user. I don't think it would be better to show usernames as during the process we only have user Ids during this process and for each user, I have to call an additional API to fetch the usernames.

Please give your views on this.

@mickmister
Copy link
Contributor

@Kshitij-Katiyar @DHaussermann I think we want to be cognizant of logging too much here. At the moment, we have a debug log before the job runs, and one after the job runs. These logs simply say that the job began and ended. There are some different levels of granularity we can add here:

  • Log how many users processed, and how many had errors
  • Include how many users had their status changed

The individual errors have the user ids in them, so no need to include the user ids in the aggregate log message at the end of the job. I think the first bullet point is appropriate here. If the second one is not a big hassle, then I think we should also include the "number of statuses changed".

@spirosoik
Copy link
Member

/update-branch

@mattermost-build
Copy link

We don't have permissions to update this PR, please contact the submitter to apply the update.

* [MI-2705]:Added summary for status sync job

* [MI-2705]:Fixed review comments

* Fixed lint errors (mattermost#243)

* [MI-2705]:Fixed review comments

* [MI-2705]:Fixed review comments

---------

Co-authored-by: Manoj Malik <manoj@brightscout.com>
@Kshitij-Katiyar
Copy link
Contributor Author

@mickmister @DHaussermann Added the functionality of showing the summary for the status sync job

@hanzei hanzei added this to the 244 milestone Feb 7, 2023
@hanzei hanzei linked an issue Feb 7, 2023 that may be closed by this pull request
@hanzei hanzei removed this from the 244 milestone Feb 7, 2023
Copy link
Collaborator

@hanzei hanzei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of unrelated changes in the PR. Could you please remove them?

server/config/config.go Outdated Show resolved Hide resolved
Comment on lines +80 to +84
if numberOfLogs < logTruncateLimit {
m.Logger.Warnf("Not able to load user %s from user index. err=%v", u.MattermostUserID, err)
} else if numberOfLogs == logTruncateLimit {
m.Logger.Warnf(logTruncateMsg)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of posing a truncated message instead of the "real" one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanzei Please read this comment #244 (comment).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mickmister I don't understand your point in #244 (comment). If the goal is not to flood the logs, why are messages logged that only contain the information that the logs got trunkated? Isn't the number of log messages still the same?

Copy link
Contributor Author

@Kshitij-Katiyar Kshitij-Katiyar Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanzei @mickmister I think the number of logs will not be the same. We will log the first 4 logs coming from a particular function/code snippet, then for the 5th one we will log the Truncate and after that, we will not be logging anything. So the max number of logs will be 5 in any case. If it is more than that we are truncating means not logging anything for a particular function.

@Kshitij-Katiyar Kshitij-Katiyar requested review from hanzei and removed request for DHaussermann February 8, 2023 06:16
@mattermost-build
Copy link

This PR has been automatically labelled "stale" because it hasn't had recent activity.
A core team member will check in on the status of the PR to help with questions.
Thank you for your contribution!

@mattermost-build
Copy link

This PR has been automatically labelled "stale" because it hasn't had recent activity.
A core team member will check in on the status of the PR to help with questions.
Thank you for your contribution!

@m1lt0n
Copy link

m1lt0n commented Mar 10, 2023

@mickmister @hanzei @DHaussermann what's the state of this related to a) code review, b) QA review? I see it is approved for (a), so is it a matter of finalising the QA review on that?

@codecov-commenter
Copy link

codecov-commenter commented Mar 10, 2023

Codecov Report

Patch coverage: 43.90% and project coverage change: +0.37 🎉

Comparison is base (7c0c066) 23.31% compared to head (c25bd52) 23.68%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #245      +/-   ##
==========================================
+ Coverage   23.31%   23.68%   +0.37%     
==========================================
  Files          62       62              
  Lines        3049     3090      +41     
==========================================
+ Hits          711      732      +21     
- Misses       2259     2277      +18     
- Partials       79       81       +2     
Impacted Files Coverage Δ
server/command/availability.go 0.00% <0.00%> (ø)
server/mscalendar/availability.go 55.66% <45.00%> (-0.66%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mickmister
Copy link
Contributor

@m1lt0n @DHaussermann Indeed it seems this is just awaiting QA review

@DHaussermann
Copy link

@Kshitij-Katiyar I can't test this. When I compile and deploy this to a cloud server (my local Darwin ARM seems to not be supported) The plugin starts up and then starts crashing until health-check eventually takes it down.

{
  "caller": "app/plugin_api.go:976",
  "level": "error",
  "msg": "Error during daily summary job. err=not found",
  "plugin_id": "com.mattermost.mscalendar",
  "timestamp": "2023-03-14 19:00:29.374 Z"
}

After this error it seems to crash
image

Steps:

  • Installed the plugin on a new cloud server where it's never run before
  • Configure the plugin and enable it
  • Nav to the user site and try running the connect command
    Observed:
    Plugin has crashed.

Can you please let me know if you can reproduce this?

@Kshitij-Katiyar
Copy link
Contributor Author

Kshitij-Katiyar commented Mar 16, 2023

@DHaussermann @mickmister I have fixed this. Actually in the code at one point I was accessing the attributes of a nil object, resulting in panic by mistake. This could only happen in an instance where there was no previous ms-calendar plugin installed. I have added a check now and also tested it on a new instance everything worked fine. Please test this again and let me know if you find any issues.

@DHaussermann
Copy link

Hi @Kshitij-Katiyar
I did some testing on this today.

  • The panic issue is fixed for new installs
  • I can see that the sync service starts and logs an output when it runs

The piece I'm struggling to test here is to see and error occur on 1 of the records and the job will still complete. from your summary "`err=not found during user status sync job..." How can I reproduce such an error?

I have this running locally with 2 users and have tried manually updating several values in KV store to invalidate them to try and cause such an error so I can see the sync complete. So far this has not worked. Updating oauth2_ values or user_ to try and invalidate them does cause the user to loose their connection but it still does not cause the job to error out on any of the records.

Are you able to offer any suggestions here?

@Kshitij-Katiyar
Copy link
Contributor Author

Kshitij-Katiyar commented Mar 21, 2023

Hi @Kshitij-Katiyar I did some testing on this today.

  • The panic issue is fixed for new installs
  • I can see that the sync service starts and logs an output when it runs

The piece I'm struggling to test here is to see and error occur on 1 of the records and the job will still complete. from your summary "`err=not found during user status sync job..." How can I reproduce such an error?

I have this running locally with 2 users and have tried manually updating several values in KV store to invalidate them to try and cause such an error so I can see the sync complete. So far this has not worked. Updating oauth2_ values or user_ to try and invalidate them does cause the user to loose their connection but it still does not cause the job to error out on any of the records.

Are you able to offer any suggestions here?

Hi @DHaussermann , I have done the changes for the part that even if we get an error from 1 user the job will still continue for other users. If you want to get or reproduce the above error user not found I can tell you what I have tried -

  1. Connect the user using the slash command.
  2. You can delete the key value for that user from the KV store either by going inside DB (if you can and I am using postgres) and run DELETE FROM pluginkeyvaluestore where pluginid ='com.mattermost.mscalendar' AND pkey = 'user_<key for your user>'; or using MM APIs to delete this KV pair.

Screenshot from 2023-03-21 13-05-07

{"timestamp":"2023-03-21 13:10:01.485 +05:30","level":"warn","msg":"Not able to load user 3p41xahsffbh9kqk1jcx9j1tpw from user index. err=not found","caller":"app/plugin_api.go:943","plugin_id":"com.mattermost.mscalendar"}

@DHaussermann
Copy link

Thanks @Kshitij-Katiyar taking another look and using your suggestion of deleting the records from the store - I can see the user not found errors

warn  [2023-03-21 10:10:00.389 -04:00] Not able to load user 6sbsdz3rgprs3f8s64nxft8s5a from user index. err=not found caller="app/plugin_api.go:979" plugin_id=com.mattermost.mscalendar
warn  [2023-03-21 10:10:00.399 -04:00] Not able to load user q6cs674637nguebthxp5r8m1rc from user index. err=not found caller="app/plugin_api.go:979" plugin_id=com.mattermost.mscalendar
debug [2023-03-21 10:10:00.399 -04:00] User status sync job finished.
Summary
Number of users processed:- 2
Number of users had their status changed:- 0
Number of users had errors:- 2 caller="app/plugin_api.go:970" plugin_id=com.mattermost.mscalendar

The issue yesterday was that Number of users had errors:- would continue showing as 0. I'm now not certain if Not able to load user <user ID> from user index was happening or not but even if it was, it would always show the error count as 0. Maybe your latest change may have affected this.

Also @mickmister does this last commit require code review?

@Kshitij-Katiyar
Copy link
Contributor Author

Thanks @Kshitij-Katiyar taking another look and using your suggestion of deleting the records from the store - I can see the user not found errors

warn  [2023-03-21 10:10:00.389 -04:00] Not able to load user 6sbsdz3rgprs3f8s64nxft8s5a from user index. err=not found caller="app/plugin_api.go:979" plugin_id=com.mattermost.mscalendar
warn  [2023-03-21 10:10:00.399 -04:00] Not able to load user q6cs674637nguebthxp5r8m1rc from user index. err=not found caller="app/plugin_api.go:979" plugin_id=com.mattermost.mscalendar
debug [2023-03-21 10:10:00.399 -04:00] User status sync job finished.
Summary
Number of users processed:- 2
Number of users had their status changed:- 0
Number of users had errors:- 2 caller="app/plugin_api.go:970" plugin_id=com.mattermost.mscalendar

The issue yesterday was that Number of users had errors:- would continue showing as 0. I'm now not certain if Not able to load user <user ID> from user index was happening or not but even if it was, it would always show the error count as 0. Maybe your latest change may have affected this.

Also @mickmister does this last commit require code review?

@DHaussermann Ok yes I have fixed it. It was a very small change.

@DHaussermann DHaussermann requested a review from hanzei March 21, 2023 19:24
err := m.Filter(withSuperuserClient)
if err != nil {
return "", err
return "", &StatusSyncJobSummary{}, errors.Wrap(err, "not able to filter the super user client")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pick: In general, functions that return a non-nil error should have the other fields set to nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanzei Actually here I am not returning nil because, in its parent functions, I am using an attribute of this object, and if I return nil from here that could give panic.

Copy link

@DHaussermann DHaussermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and passed

  • When the sync fails for one or more users the job still completes
  • warn records are now shown is the server log for failed users
  • A count of processed and failed users is shown in the server logs after the job runs
  • Retested after last commit for any functional changes
    LGTM!

@DHaussermann DHaussermann added 4: Reviews Complete All reviewers have approved the pull request and removed 3: QA Review Requires review by a QA tester labels Mar 22, 2023
@DHaussermann
Copy link

@mickmister or @hanzei please merge when you have a chance.
We also need a v1.1.1 versions bump PR.

@mickmister
Copy link
Contributor

/update-branch

@mattermost-build
Copy link

We don't have permissions to update this PR, please contact the submitter to apply the update.

@Kshitij-Katiyar
Copy link
Contributor Author

/update-branch

@mickmister @hanzei Synced with master

@mattermost-build
Copy link

We don't have permissions to update this PR, please contact the submitter to apply the update.

@mickmister mickmister merged commit 347546a into mattermost:master Mar 22, 2023
@mickmister mickmister mentioned this pull request Mar 22, 2023
@hanzei hanzei mentioned this pull request Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Status sync job exits if an error happens for a single user
8 participants