A Glean Administrator at one of our customers has notified us that the search results that users are getting for Atlassian content appears to be stale. Glean Admins are our primary contacts at the customer who are in charge of ensuring the Glean system is being provided to their end users at their company. Prepare a document and walk us through how you would go about resolving this issue. It is ok to speculate on capabilities that exist in Glean or Atlassian so that we can understand your thinking and logic. Include the following
Steps you would take to validate the issue Technical assumptions you have about Glean and Atlassian products to help you troubleshoot Your assessment of what the problem actually is and where in Glean or Atlassian (or both) you think the issue resides Logical steps you would take to resolve the issue How and when you would communicate with the Glean Admin about the issue and what type of messages you would send to them along the way Any steps you would take after the issue is resolved besides the customer communication, e.g. continuous improvement Expect that we may ask questions and that it is ok if you ask clarifying questions of the panel along the way.
What I think the issue is: Based on the flow of the architecture that I was given. It seems like what the issue may be is that the configuration to the Atlassian endpoint might be broken. Either due to a changed permission or maybe the configuration is no longer correctly set up.
Steps I would take to validate the issue:
Understanding the Problem
Go into their glean console and see what they mean when they say that things are stale. “Hi, can you describe to me what’s going wrong?” And if they want to, what they think is going on. “Hi, can you show me the steps you took to see the stale data?” Write out the steps (for future reference) Screen doesn’t appear to be showing.... “What should be appearing? If you log out of Glean and back in does the issue resolve itself? Every time you use Glean it needs your SSO and so if a bad cookie was cached that might have made it so that things weren’t updating if this is the problem jump to “What does the error say?” If there is an error message about query blacklisting, check that the issue isn’t with their configured Cloud storage regex matching- “Does this only happen to you or is this happening company wide? How much is this affecting you/people’s ability to do their jobs?” If it’s just them jump to the section Check if you can search within your glean and see the same issue. If so jump to Does this only happen to Atlassian products?Is it all of them? (test using Confluence, Jira, trello, etc. ) If it’s happening company wide and only to specific Atlassian products jump to the section Have there been any outages with or ? if this is the problem jump to “How long has this been happening?” Keep note of this to use when you query your logs to see if this timeline matches up. Or to be able to be used as a filter. Evaluate escalation
How many people is this affecting? Does much does it impact their ability to do their job? How important is this customer? If we need to escalate here is a script: Hi, we have this major problem here where a lot of customers/a high profile customer is not able to perform work because they’re getting stale Glean data. I’ve made a ticket with all of the information I’ve gathered so far and I will continue to try and troubleshoot and update the ticket but while I’m doing that it would be helpful if we can work together or in parallel to help solve this issue quickly.
Reproducing the problem with a test dummy Confluence doc
Have them create a Confluence doc so you can validate different flows within Glean (their user permission) Create a confluence doc titled “Test file to see Stale Atlassian Data - user permission” Validate the secrets/configs are set up correctly follow the instructions in Test the Data Ingestion flow follow the instructions in Validate the Data processing pipeline follow the instructions in Have the user query for the doc follow the instructions in Check with Engineering to see if anything was changed recently (db or product wise) using the date that you found Let the customer know what you have done so far
Communicating the Problem:
Classify the Problem:
Issue with the Config/secrets: Immediate remediation - go in and fix it asap this is so we can track if issues like this keep happening If it does: maybe we need to update our documentation or fix a flow in the code so that there can’t be this human error Let the customer know you have fixed it Ask the to try again and see if the data is still stale triage the severity of this issue: how important is the customer how many customers will be affected how much will this impact the customer/can most of glean’s functionality still there Let engineering know. Submit a ticket. If it’s a bigger issue figure out who to reach out to who can email all of our customers and let them know that there’s this issue but we’re working to fix it. Let the Customer know the fix and how long it might take Fixing the Problem:
Timeline:
If the security teams thinks that the issue is exploitable it’ll be fixed and rolled out immediately (within a day) Any relevant critical severity issues found in Github vulnerabilities scans or container registry scans are fixed within 1 week from when it was first detected. The detection is typically within a few hours for both Gihub and Google container registry. If it’s an issue with the GCP assets scanner and elastic CVE’s it’ll be fixed within 5 weeks. This is because the elastic cluster is only reachable from the private VPC If this is too long for the user and the mistake is on our side, we can try to rollback to a previous version artifact so that they can use glean again Next Steps after Resolution
If it is an issue on our side. Update our QA testing docs to include a check for the issue found. If it’s a permissions things update Update any relevant documentation If the issue was because of a secret/config change: see if we need to set up more alerts to notify us when configs/secrets are changed Once the solution is rolled out follow up with the customer to see if it is fixed on their end too