Glean Architecture

Glean architecture

Glean Architecture overview

Areas in which it could go wrong

LOGS

Explore

Glean Collaboration Exercise

Glean on-call executing debug operations

IAM role access needed for Glean alert-monitoring

Test the Data ingestion flow

Validate Query

Terms

Config/Secret Store

Test the Data processing pipeline

⁠

Scenario:

A Glean Administrator at one of our customers has notified us that the search results that users are getting for Atlassian content appears to be stale. Glean Admins are our primary contacts at the customer who are in charge of ensuring the Glean system is being provided to their end users at their company. Prepare a document and walk us through how you would go about resolving this issue. It is ok to speculate on capabilities that exist in Glean or Atlassian so that we can understand your thinking and logic. Include the following

Steps you would take to validate the issue

Technical assumptions you have about Glean and Atlassian products to help you troubleshoot

Your assessment of what the problem actually is and where in Glean or Atlassian (or both) you think the issue resides

Logical steps you would take to resolve the issue

How and when you would communicate with the Glean Admin about the issue and what type of messages you would send to them along the way

Any steps you would take after the issue is resolved besides the customer communication, e.g. continuous improvement

Expect that we may ask questions and that it is ok if you ask clarifying questions of the panel along the way.

What I think the issue is: Based on the flow of the architecture that I was given. It seems like what the issue may be is that the configuration to the Atlassian endpoint might be broken. Either due to a changed permission or maybe the configuration is no longer correctly set up.

Steps I would take to validate the issue:

Understanding the Problem

Go into their glean console and see what they mean when they say that things are stale.

“Hi, can you describe to me what’s going wrong?”

And if they want to, what they think is going on.

⁠

“Hi, can you show me the steps you took to see the stale data?”

Write out the steps (for future reference)

Customer opened up ....

Screen doesn’t appear to be showing....

“What should be appearing?

⁠

If you log out of Glean and back in does the issue resolve itself?

⁠

reasoning:

Every time you use Glean it needs your SSO and so if a bad cookie was cached that might have made it so that things weren’t updating

if this is the problem jump to

Communicating the Problem⁠

⁠

“Does an error pop up?”

“What does the error say?”

⁠

If there is an error message about query blacklisting, check that the issue isn’t with their configured Cloud storage regex matching-

check here to verify⁠

⁠

“Does this only happen to you or is this happening company wide? How much is this affecting you/people’s ability to do their jobs?”

⁠

If it’s just them jump to the

Checking Configurations⁠

section

Check if you can search within your glean and see the same issue. If so jump to

Testing the Data ingestion flow⁠

⁠

Does this only happen to Atlassian products?Is it all of them? (test using Confluence, Jira, trello, etc. )

⁠

If it’s happening company wide and only to specific Atlassian products jump to the

Checking Configurations⁠

section

Have there been any outages with

Atlassian⁠

GCP⁠

if this is the problem jump to

Communicating the Problem⁠

⁠

“How long has this been happening?”

⁠

Keep note of this to use when you query your logs to see if this timeline matches up. Or to be able to be used as a filter.

Evaluate escalation

How many people is this affecting?

Does much does it impact their ability to do their job?

How important is this customer?

If we need to escalate here is a script:

Hi, we have this major problem here where a lot of customers/a high profile customer is not able to perform work because they’re getting stale Glean data. I’ve made a ticket with all of the information I’ve gathered so far and I will continue to try and troubleshoot and update the ticket but while I’m doing that it would be helpful if we can work together or in parallel to help solve this issue quickly.

Reproducing the problem with a test dummy Confluence doc

Have them create a Confluence doc so you can validate different flows within Glean (their user permission)

Create a confluence doc titled “Test file to see Stale Atlassian Data - user permission”

Publish the doc

Validate the secrets/configs are set up correctly

follow the instructions in

this doc⁠

⁠

Test the Data Ingestion flow

follow the instructions in

this doc⁠

⁠

Validate the Data processing pipeline

follow the instructions in

this doc⁠

⁠

Have the user query for the doc

Validate the Query

follow the instructions in

this doc⁠

⁠

Consult with Engineering

Check with Engineering to see if anything was changed recently (db or product wise) using the date that you found

Ask in Slack

Look at Github commits

Create a Ticket

Let the customer know what you have done so far

Communicating the Problem:

Classify the Problem:

Issue with the Config/secrets:

Immediate remediation - go in and fix it asap

Create a ticket

this is so we can track if issues like this keep happening

If it does: maybe we need to update our documentation or fix a flow in the code so that there can’t be this human error

Let the customer know you have fixed it

Ask the to try again and see if the data is still stale

Issue with Glean:

triage the severity of this issue:

how important is the customer

how many customers will be affected

how much will this impact the customer/can most of glean’s functionality still there

Let engineering know. Submit a ticket.

If it’s a bigger issue figure out who to reach out to who can email all of our customers and let them know that there’s this issue but we’re working to fix it.

Let the Customer know the fix and how long it might take

Fixing the Problem:

Timeline:

If the security teams thinks that the issue is exploitable it’ll be fixed and rolled out immediately (within a day)

Any relevant critical severity issues found in Github vulnerabilities scans or container registry scans are fixed within 1 week from when it was first detected. The detection is typically within a few hours for both Gihub and Google container registry.

If it’s an issue with the GCP assets scanner and elastic CVE’s it’ll be fixed within 5 weeks.

This is because the elastic cluster is only reachable from the private VPC

If this is too long for the user and the mistake is on our side, we can try to rollback to a previous version artifact so that they can use glean again