icon picker
Glean architecture

DataFlow

QueryFlow

Users perform searches on which is a global web app hosted on Glean’s cloud. And we host the client code here. It doesn’t have complicated logic. It’s really simple and just serves the static files that together form the “client code”.
Once they go there. The client code is loaded on the browser. But since you’re not logged in (checks local storage and notices there’s no state) your identity can’t be authenticated and so it requires you to login first
The client begins the “login flow”. You type in your email which is used to ask the global-web-app what real server should be used given the email. The global web app responds with “https:customerdomain-be.glean.com/”. That url is an alias for the QE (query endpoint) which is running in the project that hosts all of the customer’s data. That project is under the customer’s IT team and inside the customer’s firewall.
Afterwards, everything the user does on the client goes directly to (which is the project’s QE). The global-web-app is no longer involved anymore. The QE authenticates the user using enterprise SSO and serves the user’s queries.
The queries and search results are transmitted over HTTPS (post SSO authentication) between the user’s browser and the QE server running inside the customer’s GCP project inside their cloud account

Data ingestion flow

For every enterprise data source that’s connected to Glean, they run content and identity connectors in the cloud project that fetch data and permissions map from that source
A webhook is a mechanism for sending real-time notifications from one application to another. It enables real-time data synchronization between two different systems or applications. For example, for an e-commerce website, if you wan to receive real-time notifications when a new order is placed so you can process the order as quickly as possible, you would set up a webhook to send a HTTP POST request to your server whenever a new order is placed. Then you can process the request and update the system accordingly. Purposes include:
integrating different applications or servicees
automating workflows
triggering events or actions based on specific conditions or data changes
real-time data synchronization
providing notifications to users or administration
mostly it provides real time communication and data exchange between different systems or applications to streamline workflows, increase efficiency, and enhance the user experience
The connectors run periodically and in response to web hook events.
store the fetched information into Glean’s document and identity store.
stores newly fetched content from the dataflow pipeline into the secure search index
code running inside the GCP project → fetches content from the enterprise applications over HTTPS over the public web (if the app is hosted on the internet, eg. Google Drive) or over a private internal connection (if the app is hosted inside the customer’s network, eg. Jira)

Data Processing pipelines

Once the data is fetched it is further processed within the GCP app.
All data processing happens using google dataflow pipelines. The data never leaves the project.

GLEAN SDLC process

The deployment pipeline only deploys builds that have been validated and signed by the central trusted service
Highly unlikely this is caused by an error in the code because they test for vulnerabilities. To detect vulnerabilities we use github vulnerability scanning, GCP security command’s web security scanner, GCP asset scanner, and container registry scanner.
the central locked down build service periodically reads code from trusted branches, builds the relevant docker containers and signs using binary authorization. If the service is locked down only release engineers can trigger a build and modify the pipeline. The service also has authorization policies to only allow engineers to trigger release builds and deploys.
The central deployment workflow only has the capability to invoke a specific cloud function in the customer’s GCP app and that function takes the name of the release to upgrade to.
That system self upgrades to the signed release specified above by downloading the release from a trusted location after verifying the integrity using binary authorization (code is first digitally signed by a trusted entity eg. software vendor or trusted third party. And then when submitted for deployment it’s checked agains the trusted signature and if it matches it can be installed and run.)
The releases go through an internal soak and automated and manual QA testing which includes P0 security and permission tests before they’re deployed to customers.
To debug things, we can also get access to their project. The Keys are stored in a locked down Vault instance and glean team members investigation the issue can obtain a 1-hour valid key after providing sufficient justification
Soc2 (Service Organization Control 2) Type2 compliant is a framework established by the American Institute of CPAs to assess the effectiveness of a company’s internal controls over the security, availability, processing integrity, confidentiality, and privacy of its systems and data. It’s often required for companies that provide services to other companies. Type 2 is determined over a period of time (6 months or so). To get it a company has to demonstrate it’s implemented and adhered to controls that are designed to protect its systems and data over an extended period of time

Software upgrade trust model

the customer’s glean GCP project has a “deployer” service account (glean-deployer) whose key is shared with the Glean on-call team. It has minimal IAM privileges. But it has the IAM role to invoke Cloud functions in the GCP project and the ability to view the contents of the config Cloud Storage bucket but nothing else.
In order to deploy a software upgrade, glean’s central build server uses the deployer service account’s key to invoke the deploy_build cloud function exposed in the GCP project.

Security

The system runs inside a GC project and inherits all security settings of the customer’s cloud environment.
therefore if the customer’s cloud environment changed a security setting it could mean that certain apps can no longer run inside the GC project
The only system that is exposed as a web service is the Query Endpoint service. It receives a query from a user signed in to Enterprise SSO (Okta) and serves search results as a response back to the user
it also receives activity data (search results viewed/clicked by the user, views of whitelisted enterprise applications) reported by the Glean front-end app.
All requests to the Query Endpoint require an authenticated cookie issued to the user as part of the SSO login.
Potentially, if the authenticated cookie isn’t there they couldn’t get the request to the Query Endpoint and so maybe what is being served is the cache on the frontend. So seeing if you clear the cache if Glean still runs
For data sources that support web hooks there are endpoints listening for notifications about modified content from the datasource. The web hook notifications payload will be signed by a secret established out-of-band with the datasource to prevent spoofing. These endpoints return no data and no other requests are allowed. The web hook requests serve primarily as crawl hints and trigger the system to make the API call to fetch the modified content.
check if this is a datasource that supports web hooks because if so, the endpoints listening to notifications about modified content from the datasource might be broken and therefore the datasources aren’t being updated

User Data Access Enforcement

Glean continuously syncs the ACLs for each document to mirror permissions within our system in near real time
Can enforce data access rules at Product, object, and record level

Control Plane

All control plane operations (deploying code, data files and configs, updating system metadata, etc) are done via Cloud APIs (eg. gcloud SDK). They are fully authenticated (IAM) and made over secure channels. These operations are logged as part of GCP’s comprehensive admin activity login in Stack driver.
since these are done via Cloud APIs, it’s very unlikely that these would change without the previous versions still being supported. It would most likely be either that they changed their configs on their side. Or that there was some IAM permission change

* Wonder if can ask Greg what some common mistakes people who have done the collaboration have made.

Logging, Monitoring, Alerting, Tracing, Audit trail

do all of this using StackDriver
Query/Activity logs: user identity and other details for each query on Glean
System deployment logs: full log for all software upgrade operations

Logs stored in GCP project

System logs
Non-PII logs (400 day retention) available in Stackdriver GCP console ()
PII logs are in the glean_sensitive_logs_bigquery and audit_logs BigQuery table in the GCP project (30 day retention)
pII : information like employee emails or permission group names. not content stored in the document body.
IAM roles for the BigQuery only allow GCP’s admin and not Glean employees
only in rare debugging scenarios do Glean employees lookup specific log entries using debugging APIs to debug production issues. This has to be authorized by engineering leadership team
System audit log
comprehensive GCP audit logs (400 day retention)
enabled logging changes to GCP system components
enabled by default unless GCP organization policies are set to prevent some type of audit logging
Glean employees can view admin activity and system events audit logs which don’t contain PII
User Activity Logs
Available for searches and actions done by the customer’s employees in Glean
not accessible to Glean employees unless the customer has allowed for debugging purposes
GCP storage bucket (270 dat retention) scio-<projectid>-query-endpoint-access:
logs for all search queries being made
each entry has user identity and the query performed
GCP storage bucket (270 data retention) scio-<projectid>-search-query, scio-<projectid>-search-result, scio-<projectid>-search-result-feedback
has entries for queries
results we return per query
clicks/vieews for the results
used mainly by our ranking pipeline to improve the search
GCP Audit Log
ex. querying all access from a certain user or service account following:
Example:
logName:"cloudaudit.googleapis.com"
protoPayload.authenticationInfo.principalEmail="<user or service account email, e.g. 550282177806@cloudbuild.gserviceaccount.com>"
GCP Error Reporting dashboard
counts, analyzes, and aggregates the crashes in the running cloud services
stack traces visible to Glean employees
Anonymized (non-PII) logs sent to Glean’s central server
the system sends back anonymized non-PII logs to Glean’s central server scio-apps from the project
completely anonymized (user ids, document urls, query terms, and other PII is scrubbed and sent in hashed form to Glean’s central server)
the hash can only be used to correlate actions done by a user or on a document in a search session without knowing any details of the user, query, or document
sent to a BigQuery table in Glean’s central GCP project by a GCP log sink
Workflow for anonymized logs
Glean code has specialized APIs for logging info that needs to be anonymized to Google stackdriver to specific named logs


Data Retention

mirrors the data retention policy of the data source that it’s indexing

Data Encryption

All of the data is encrypted by GCP natively () using AES 256 or better. All of the in-transit data is over TLS1.2 or better and communication with the VPC is protected by Google. It’s encrypted if it moves out of their infrastructure. ()
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.