icon picker
Glean architecture

DataFlow

QueryFlow

Users perform searches on which is a global web app hosted on Glean’s cloud. And we host the client code here. It doesn’t have complicated logic. It’s really simple and just serves the static files that together form the “client code”.
Once they go there. The client code is loaded on the browser. But since you’re not logged in (checks local storage and notices there’s no state) your identity can’t be authenticated and so it requires you to login first
The client begins the “login flow”. You type in your email which is used to ask the global-web-app what real server should be used given the email. The global web app responds with “https:customerdomain-be.glean.com/”. That url is an alias for the QE (query endpoint) which is running in the project that hosts all of the customer’s data. That project is under the customer’s IT team and inside the customer’s firewall.
Afterwards, everything the user does on the client goes directly to (which is the project’s QE). The global-web-app is no longer involved anymore. The QE authenticates the user using enterprise SSO and serves the user’s queries.
The queries and search results are transmitted over HTTPS (post SSO authentication) between the user’s browser and the QE server running inside the customer’s GCP project inside their cloud account

Data ingestion flow

For every enterprise data source that’s connected to Glean, they run content and identity connectors in the cloud project that fetch data and permissions map from that source
A webhook is a mechanism for sending real-time notifications from one application to another. It enables real-time data synchronization between two different systems or applications. For example, for an e-commerce website, if you wan to receive real-time notifications when a new order is placed so you can process the order as quickly as possible, you would set up a webhook to send a HTTP POST request to your server whenever a new order is placed. Then you can process the request and update the system accordingly. Purposes include:
integrating different applications or servicees
automating workflows
triggering events or actions based on specific conditions or data changes
real-time data synchronization
providing notifications to users or administration
mostly it provides real time communication and data exchange between different systems or applications to streamline workflows, increase efficiency, and enhance the user experience
The connectors run periodically and in response to web hook events.
store the fetched information into Glean’s document and identity store.
stores newly fetched content from the dataflow pipeline into the secure search index
code running inside the GCP project → fetches content from the enterprise applications over HTTPS over the public web (if the app is hosted on the internet, eg. Google Drive) or over a private internal connection (if the app is hosted inside the customer’s network, eg. Jira)

Data Processing pipelines

Once the data is fetched it is further processed within the GCP app.
All data processing happens using google dataflow pipelines. The data never leaves the project.

GLEAN SDLC process

The deployment pipeline only deploys builds that have been validated and signed by the central trusted service
Highly unlikely this is caused by an error in the code because they test for vulnerabilities. To detect vulnerabilities we use github vulnerability scanning, GCP security command’s web security scanner, GCP asset scanner, and container registry scanner.
the central locked down build service periodically reads code from trusted branches, builds the relevant docker containers and signs using binary authorization. If the service is locked down only release engineers can trigger a build and modify the pipeline. The service also has authorization policies to only allow engineers to trigger release builds and deploys.
The central deployment workflow only has the capability to invoke a specific cloud function in the customer’s GCP app and that function takes the name of the release to upgrade to.
That system self upgrades to the signed release specified above by downloading the release from a trusted location after verifying the integrity using binary authorization (code is first digitally signed by a trusted entity eg. software vendor or trusted third party. And then when submitted for deployment it’s checked agains the trusted signature and if it matches it can be installed and run.)
The releases go through an internal soak and automated and manual QA testing which includes P0 security and permission tests before they’re deployed to customers.
To debug things, we can also get access to their project. The Keys are stored in a locked down Vault instance and glean team members investigation the issue can obtain a 1-hour valid key after providing sufficient justification
Soc2 (Service Organization Control 2) Type2 compliant is a framework established by the American Institute of CPAs to assess the effectiveness of a company’s internal controls over the security, availability, processing integrity, confidentiality, and privacy of its systems and data. It’s often required for companies that provide services to other companies. Type 2 is determined over a period of time (6 months or so). To get it a company has to demonstrate it’s implemented and adhered to controls that are designed to protect its systems and data over an extended period of time

Software upgrade trust model

the customer’s glean GCP project has a “deployer” service account (glean-deployer) whose key is shared with the Glean on-call team. It has minimal IAM privileges. But it has the IAM role to invoke Cloud functions in the GCP project and the ability to view the contents of the config Cloud Storage bucket but nothing else.
In order to deploy a software upgrade, glean’s central build server uses the deployer service account’s key to invoke the deploy_build cloud function exposed in the GCP project.

Security

The system runs inside a GC project and inherits all security settings of the customer’s cloud environment.
therefore if the customer’s cloud environment changed a security setting it could mean that certain apps can no longer run inside the GC project
The only system that is exposed as a web service is the Query Endpoint service. It receives a query from a user signed in to Enterprise SSO (Okta) and serves search results as a response back to the user
it also receives activity data (search results viewed/clicked by the user, views of whitelisted enterprise applications) reported by the Glean front-end app.
All requests to the Query Endpoint require an authenticated cookie issued to the user as part of the SSO login.
Potentially, if the authenticated cookie isn’t there they couldn’t get the request to the Query Endpoint and so maybe what is being served is the cache on the frontend. So seeing if you clear the cache if Glean still runs
For data sources that support web hooks there are endpoints listening for notifications about modified content from the datasource. The web hook notifications payload will be signed by a secret established out-of-band with the datasource to prevent spoofing. These endpoints return no data and no other requests are allowed. The web hook requests serve primarily as crawl hints and trigger the system to make the API call to fetch the modified content.
check if this is a datasource that supports web hooks because if so, the endpoints listening to notifications about modified content from the datasource might be broken and therefore the datasources aren’t being updated

User Data Access Enforcement

Glean continuously syncs the ACLs for each document to mirror permissions within our system in near real time
Can enforce data access rules at Product, object, and record level

Control Plane

All control plane operations (deploying code, data files and configs, updating system metadata, etc) are done via Cloud APIs (eg. gcloud SDK). They are fully authenticated (IAM) and made over secure channels. These operations are logged as part of GCP’s comprehensive admin activity login in Stack driver.
since these are done via Cloud APIs, it’s very unlikely that these would change without the previous versions still being supported. It would most likely be either that they changed their configs on their side. Or that there was some IAM permission change

* Wonder if can ask Greg what some common mistakes people who have done the collaboration have made.

Logging, Monitoring, Alerting, Tracing, Audit trail

do all of this using StackDriver
Query/Activity logs: user identity and other details for each query on Glean
System deployment logs: full log for all software upgrade operations

Logs stored in GCP project

System logs
Non-PII logs (400 day retention) available in Stackdriver GCP console ()
PII logs are in the glean_sensitive_logs_bigquery and audit_logs BigQuery table in the GCP project (30 day retention)
pII : information like employee emails or permission group names. not content stored in the document body.
IAM roles for the BigQuery only allow GCP’s admin and not Glean employees
only in rare debugging scenarios do Glean employees lookup specific log entries using debugging APIs to debug production issues. This has to be authorized by engineering leadership team
System audit log
comprehensive GCP audit logs (400 day retention)
enabled logging changes to GCP system components
enabled by default unless GCP organization policies are set to prevent some type of audit logging
Glean employees can view admin activity and system events audit logs which don’t contain PII
User Activity Logs
Available for searches and actions done by the customer’s employees in Glean
not accessible to Glean employees unless the customer has allowed for debugging purposes
GCP storage bucket (270 dat retention) scio-<projectid>-query-endpoint-access:
logs for all search queries being made
each entry has user identity and the query performed
GCP storage bucket (270 data retention) scio-<projectid>-search-query, scio-<projectid>-search-result, scio-<projectid>-search-result-feedback
has entries for queries
results we return per query
clicks/vieews for the results
used mainly by our ranking pipeline to improve the search
GCP Audit Log
ex. querying all access from a certain user or service account following:
Example:
logName:"cloudaudit.googleapis.com"
protoPayload.authenticationInfo.principalEmail="<user or service account email, e.g. 550282177806@cloudbuild.gserviceaccount.com>"
GCP Error Reporting dashboard
counts, analyzes, and aggregates the crashes in the running cloud services
stack traces visible to Glean employees
Anonymized (non-PII) logs sent to Glean’s central server
the system sends back anonymized non-PII logs to Glean’s central server scio-apps from the project
completely anonymized (user ids, document urls, query terms, and other PII is scrubbed and sent in hashed form to Glean’s central server)
the hash can only be used to correlate actions done by a user or on a document in a search session without knowing any details of the user, query, or document
sent to a BigQuery table in Glean’s central GCP project by a GCP log sink
Workflow for anonymized logs
Glean code has specialized APIs for logging info that needs to be anonymized to Google stackdriver to specific named logs


Data Retention

mirrors the data retention policy of the data source that it’s indexing

Data Encryption

All of the data is encrypted by GCP natively () using AES 256 or better. All of the in-transit data is over TLS1.2 or better and communication with the VPC is protected by Google. It’s encrypted if it moves out of their infrastructure. ()
For sensitive content like API tokens, an additional layer of (AES 256) encryption is used using Google managed symmetric keys that are automatically rotated monthly
ability to invoke key management service is restricted to very few service accounts in the GCP project using IAM roles

Least privilege access model

follows a least privilege access model
Owner: customer IT admin will be the project owner. Glean doesn’t have access to this service account. Not used other than for performing exceptional operations like deleting the project.
Monitoring: this account has access to Stackdriver monitoring, non-pII logging and alerting dashboards. Glean access. no access to customer data
Query debugging: customer can specify a restricted set of search queries that the Glean team can use for debugging product issues.
need to go through this
Glean Access details for GCP account - External

Security considerations for system components

Content Connectors

make outbound HTTPS connections to lots of external systems using credentials stored in a Cloud SQL database
If the data isn’t being updated maybe one of these credentials is incorrect
they also make connections to other Glean components. They’re made using GCP’s client libraries and authenticated using Google account credentials made over secure channels
these don’t really seem like they can be broken. But maybe check to see if google is facing issues or has an outage or something and that’s why it’s broken

Secrets Store

When an application is connected to Glean, the credentials (client id, secret, tokens, etc.) are stored in a secure store (secrets store). This resides inside the GCP project
the secret store relies on GCP’s native KMS (key management service), which rotates the key every month
the content connectors query the secrets store to fetch the credentials needed to make API calls to the enterprise applications. Fetched and then discarded. They’re not stored in any other component

Tasks Queue

use several cloud tasks queues for managing crawl tasks.
the crawlers will perform a small part of the crawl and then themselves post messages into the tasks queue to schedule crawls of the remaining parts
potentially. If the queue is not set up correctly or is broken then things might not be updating

Scheduler

CRON (used to kick off the crawls periodically)

SQL (for Config Store, Secrets Store, Document Store and Identities and Permissions store)

Cloud SQL instances running MySQL
the instances have private IPs
enable client TLS (transport layer security - cryptographic protocol designed to provide secure communication over a computer network such as the internet. It’s the successor to the Secure Sockets Layer (SSL) protocol and used to encrypt data as it’s transmitted between two endpoints)
we run Cloud SQL proxies that allow the other subsystems to connect to the SQL instances without having to go through the public IPs.

Dataflow

google managed dataflow pipeline used to indexing the crawled content
internal to the cloud project
use private IPs for the dataflow workers

Pub Sub

use google cloud pub sub for triggering dataflow pipeline
internal to the cloud project

Elastic search in Kubernetes

run one or more instances of elastic search in google Kubernetes engine clusters
Internal to the cloud project
shielded VMs, antivirus & file integrity monitoring
enable master authorized networks on the per-cluster GKE master (API server) for extra protection and is a proxy for requests to Elastic search.
operate over HTTPS
authenticated via Google IAM
alot of things are authenticated via google IAM and so therefore I wonder if you can tell that google IAM is not the problem then because nothing else is broken. Like do they all share the same key or nah?

Query Endpoint (QE)

search frontend
app engine app (public IP but only HTTPS)
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.