A Glean Administrator at one of our customers has notified us that the search results that users are getting for Atlassian content appears to be stale. Glean Admins are our primary contacts at the customer who are in charge of ensuring the Glean system is being provided to their end users at their company. Prepare a document and walk us through how you would go about resolving this issue. It is ok to speculate on capabilities that exist in Glean or Atlassian so that we can understand your thinking and logic. Include the following
Steps you would take to validate the issue
Technical assumptions you have about Glean and Atlassian products to help you troubleshoot
Your assessment of what the problem actually is and where in Glean or Atlassian (or both) you think the issue resides
Logical steps you would take to resolve the issue
How and when you would communicate with the Glean Admin about the issue and what type of messages you would send to them along the way
Any steps you would take after the issue is resolved besides the customer communication, e.g. continuous improvement
Expect that we may ask questions and that it is ok if you ask clarifying questions of the panel along the way.
What I think the issue is: Based on the flow of the architecture that I was given. It seems like what the issue may be is that the configuration to the Atlassian endpoint might be broken. Either due to a changed permission or maybe the configuration is no longer correctly set up.
Steps I would take to validate the issue:
Understanding the Problem
Go into their glean console and see what they mean when they say that things are stale.
“Hi, can you describe to me what’s going wrong?”
And if they want to, what they think is going on.
“Hi, can you show me the steps you took to see the stale data?”
Write out the steps (for future reference)
Customer opened up ....
Screen doesn’t appear to be showing....
“What should be appearing?
If you log out of Glean and back in does the issue resolve itself?
Every time you use Glean it needs your SSO and so if a bad cookie was cached that might have made it so that things weren’t updating
Keep note of this to use when you query your logs to see if this timeline matches up. Or to be able to be used as a filter.
How many people is this affecting?
Does much does it impact their ability to do their job?
How important is this customer?
If we need to escalate here is a script:
Hi, we have this major problem here where a lot of customers/a high profile customer is not able to perform work because they’re getting stale Glean data. I’ve made a ticket with all of the information I’ve gathered so far and I will continue to try and troubleshoot and update the ticket but while I’m doing that it would be helpful if we can work together or in parallel to help solve this issue quickly.
Reproducing the problem with a test dummy Confluence doc
Have them create a Confluence doc so you can validate different flows within Glean (their user permission)
Create a confluence doc titled “Test file to see Stale Atlassian Data - user permission”
Check with Engineering to see if anything was changed recently (db or product wise) using the date that you found
Ask in Slack
Look at Github commits
Create a Ticket
Let the customer know what you have done so far
Communicating the Problem:
Classify the Problem:
Issue with the Config/secrets:
Immediate remediation - go in and fix it asap
Create a ticket
this is so we can track if issues like this keep happening
If it does: maybe we need to update our documentation or fix a flow in the code so that there can’t be this human error
Let the customer know you have fixed it
Ask the to try again and see if the data is still stale
Issue with Glean:
triage the severity of this issue:
how important is the customer
how many customers will be affected
how much will this impact the customer/can most of glean’s functionality still there
Let engineering know. Submit a ticket.
If it’s a bigger issue figure out who to reach out to who can email all of our customers and let them know that there’s this issue but we’re working to fix it.
Let the Customer know the fix and how long it might take
Fixing the Problem:
If the security teams thinks that the issue is exploitable it’ll be fixed and rolled out immediately (within a day)
Any relevant critical severity issues found in Github vulnerabilities scans or container registry scans are fixed within 1 week from when it was first detected. The detection is typically within a few hours for both Gihub and Google container registry.
If it’s an issue with the GCP assets scanner and elastic CVE’s it’ll be fixed within 5 weeks.
This is because the elastic cluster is only reachable from the private VPC
If this is too long for the user and the mistake is on our side, we can try to rollback to a previous version artifact so that they can use glean again
Next Steps after Resolution
Update test plan
If it is an issue on our side. Update our QA testing docs to include a check for the issue found.