Explore

Glean architecture

DataFlow

QueryFlow

Users perform searches on

https://app.glean.com⁠

which is a global web app hosted on Glean’s cloud. And we host the client code here. It doesn’t have complicated logic. It’s really simple and just serves the static files that together form the “client code”.

Once they go there. The client code is loaded on the browser. But since you’re not logged in (checks local storage and notices there’s no state) your identity can’t be authenticated and so it requires you to login first

The client begins the “login flow”. You type in your email which is used to ask the global-web-app what real server should be used given the email. The global web app responds with “https:customerdomain-be.glean.com/”. That url is an alias for the QE (query endpoint) which is running in the project that hosts all of the customer’s data. That project is under the customer’s IT team and inside the customer’s firewall.

Afterwards, everything the user does on the client goes directly to

customerdomain-be.glean.com⁠

(which is the project’s QE). The global-web-app is no longer involved anymore. The QE authenticates the user using enterprise SSO and serves the user’s queries.

The queries and search results are transmitted over HTTPS (post SSO authentication) between the user’s browser and the QE server running inside the customer’s GCP project inside their cloud account

Data ingestion flow

For every enterprise data source that’s connected to Glean, they run content and identity connectors in the cloud project that fetch data and permissions map from that source

A webhook is a mechanism for sending real-time notifications from one application to another. It enables real-time data synchronization between two different systems or applications. For example, for an e-commerce website, if you wan to receive real-time notifications when a new order is placed so you can process the order as quickly as possible, you would set up a webhook to send a HTTP POST request to your server whenever a new order is placed. Then you can process the request and update the system accordingly. Purposes include:

integrating different applications or servicees

automating workflows

triggering events or actions based on specific conditions or data changes

real-time data synchronization

providing notifications to users or administration

mostly it provides real time communication and data exchange between different systems or applications to streamline workflows, increase efficiency, and enhance the user experience

The connectors run periodically and in response to web hook events.

store the fetched information into Glean’s document and identity store.

stores newly fetched content from the dataflow pipeline into the secure search index

code running inside the GCP project → fetches content from the enterprise applications over HTTPS over the public web (if the app is hosted on the internet, eg. Google Drive) or over a private internal connection (if the app is hosted inside the customer’s network, eg. Jira)

Data Processing pipelines

Once the data is fetched it is further processed within the GCP app.

All data processing happens using google dataflow pipelines. The data never leaves the project.

GLEAN SDLC process

The deployment pipeline only deploys builds that have been validated and signed by the central trusted service

Highly unlikely this is caused by an error in the code because they test for vulnerabilities. To detect vulnerabilities we use github vulnerability scanning, GCP security command’s web security scanner, GCP asset scanner, and container registry scanner.

the central locked down build service periodically reads code from trusted branches, builds the relevant docker containers and signs using binary authorization. If the service is locked down only release engineers can trigger a build and modify the pipeline. The service also has authorization policies to only allow engineers to trigger release builds and deploys.

The central deployment workflow only has the capability to invoke a specific cloud function in the customer’s GCP app and that function takes the name of the release to upgrade to.

That system self upgrades to the signed release specified above by downloading the release from a trusted location after verifying the integrity using binary authorization (code is first digitally signed by a trusted entity eg. software vendor or trusted third party. And then when submitted for deployment it’s checked agains the trusted signature and if it matches it can be installed and run.)

The releases go through an internal soak and automated and manual QA testing which includes P0 security and permission tests before they’re deployed to customers.

To debug things, we can also get access to their project. The Keys are stored in a locked down Vault instance and glean team members investigation the issue can obtain a 1-hour valid key after providing sufficient justification

Soc2 (Service Organization Control 2) Type2 compliant is a framework established by the American Institute of CPAs to assess the effectiveness of a company’s internal controls over the security, availability, processing integrity, confidentiality, and privacy of its systems and data. It’s often required for companies that provide services to other companies. Type 2 is determined over a period of time (6 months or so). To get it a company has to demonstrate it’s implemented and adhered to controls that are designed to protect its systems and data over an extended period of time

Software upgrade trust model

the customer’s glean GCP project has a “deployer” service account (glean-deployer) whose key is shared with the Glean on-call team. It has minimal IAM privileges. But it has the IAM role to invoke Cloud functions in the GCP project and the ability to view the contents of the config Cloud Storage bucket but nothing else.

In order to deploy a software upgrade, glean’s central build server uses the deployer service account’s key to invoke the deploy_build cloud function exposed in the GCP project.

Security

The system runs inside a GC project and inherits all security settings of the customer’s cloud environment.

therefore if the customer’s cloud environment changed a security setting it could mean that certain apps can no longer run inside the GC project

The only system that is exposed as a web service is the Query Endpoint service. It receives a query from a user signed in to Enterprise SSO (Okta) and serves search results as a response back to the user

it also receives activity data (search results viewed/clicked by the user, views of whitelisted enterprise applications) reported by the Glean front-end app.

All requests to the Query Endpoint require an authenticated cookie issued to the user as part of the SSO login.

Potentially, if the authenticated cookie isn’t there they couldn’t get the request to the Query Endpoint and so maybe what is being served is the cache on the frontend. So seeing if you clear the cache if Glean still runs

For data sources that support web hooks there are endpoints listening for notifications about modified content from the datasource. The web hook notifications payload will be signed by a secret established out-of-band with the datasource to prevent spoofing. These endpoints return no data and no other requests are allowed. The web hook requests serve primarily as crawl hints and trigger the system to make the API call to fetch the modified content.

check if this is a datasource that supports web hooks because if so, the endpoints listening to notifications about modified content from the datasource might be broken and therefore the datasources aren’t being updated

User Data Access Enforcement

Glean continuously syncs the ACLs for each document to mirror permissions within our system in near real time

Can enforce data access rules at Product, object, and record level

Control Plane

All control plane operations (deploying code, data files and configs, updating system metadata, etc) are done via Cloud APIs (eg. gcloud SDK). They are fully authenticated (IAM) and made over secure channels. These operations are logged as part of GCP’s comprehensive admin activity login in Stack driver.

since these are done via Cloud APIs, it’s very unlikely that these would change without the previous versions still being supported. It would most likely be either that they changed their configs on their side. Or that there was some IAM permission change

* Wonder if can ask Greg what some common mistakes people who have done the collaboration have made.

Logging, Monitoring, Alerting, Tracing, Audit trail

do all of this using StackDriver

Query/Activity logs: user identity and other details for each query on Glean

System deployment logs: full log for all software upgrade operations

Logs stored in GCP project

System logs

Non-PII logs (400 day retention) available in Stackdriver GCP console (

https://console.cloud.google.com/logs/viewe⁠

)

PII logs are in the glean_sensitive_logs_bigquery and audit_logs BigQuery table in the GCP project (30 day retention)

pII : information like employee emails or permission group names. not content stored in the document body.

IAM roles for the BigQuery only allow GCP’s admin and not Glean employees

only in rare debugging scenarios do Glean employees lookup specific log entries using debugging APIs to debug production issues. This has to be authorized by engineering leadership team

System audit log

comprehensive GCP audit logs (400 day retention)

enabled logging changes to GCP system components

enabled by default unless GCP organization policies are set to prevent some type of audit logging

Glean employees can view admin activity and system events audit logs which don’t contain PII

User Activity Logs

Available for searches and actions done by the customer’s employees in Glean

not accessible to Glean employees unless the customer has allowed for debugging purposes

GCP storage bucket (270 dat retention) scio-<projectid>-query-endpoint-access:

logs for all search queries being made

each entry has user identity and the query performed

GCP storage bucket (270 data retention) scio-<projectid>-search-query, scio-<projectid>-search-result, scio-<projectid>-search-result-feedback

has entries for queries

results we return per query

clicks/vieews for the results

used mainly by our ranking pipeline to improve the search

GCP Audit Log

sample GCP log filters:

https://cloud.google.com/logging/docs/view/query-library⁠

ex. querying all access from a certain user or service account following:

Example:

logName:"cloudaudit.googleapis.com"

protoPayload.authenticationInfo.principalEmail="<user or service account email, e.g. 550282177806@cloudbuild.gserviceaccount.com>"

GCP Error Reporting dashboard

counts, analyzes, and aggregates the crashes in the running cloud services

stack traces visible to Glean employees

Anonymized (non-PII) logs sent to Glean’s central server

the system sends back anonymized non-PII logs to Glean’s central server scio-apps from the project

completely anonymized (user ids, document urls, query terms, and other PII is scrubbed and sent in hashed form to Glean’s central server)

the hash can only be used to correlate actions done by a user or on a document in a search session without knowing any details of the user, query, or document

sent to a BigQuery table in Glean’s central GCP project by a GCP log sink

Workflow for anonymized logs

Glean code has specialized APIs for logging info that needs to be anonymized to Google stackdriver to specific named logs

Data Retention

mirrors the data retention policy of the data source that it’s indexing

Data Encryption

All of the data is encrypted by GCP natively (

https://cloud.google.com/security/encryption-at-rest/default-encryption⁠

) using AES 256 or better. All of the in-transit data is over TLS1.2 or better and communication with the VPC is protected by Google. It’s encrypted if it moves out of their infrastructure. (

https://cloud.google.com/security/encryption-in-transit⁠

)

For sensitive content like API tokens, an additional layer of (AES 256) encryption is used using Google managed symmetric keys that are automatically rotated monthly

ability to invoke key management service is restricted to very few service accounts in the GCP project using IAM roles

Least privilege access model

follows a least privilege access model

Owner: customer IT admin will be the project owner. Glean doesn’t have access to this service account. Not used other than for performing exceptional operations like deleting the project.

Monitoring: this account has access to Stackdriver monitoring, non-pII logging and alerting dashboards. Glean access. no access to customer data

Query debugging: customer can specify a restricted set of search queries that the Glean team can use for debugging product issues.

need to go through this

⁠

Glean Access details for GCP account - External

⁠

Security considerations for system components

Content Connectors

make outbound HTTPS connections to lots of external systems using credentials stored in a Cloud SQL database

If the data isn’t being updated maybe one of these credentials is incorrect

they also make connections to other Glean components. They’re made using GCP’s client libraries and authenticated using Google account credentials made over secure channels

these don’t really seem like they can be broken. But maybe check to see if google is facing issues or has an outage or something and that’s why it’s broken

Secrets Store

When an application is connected to Glean, the credentials (client id, secret, tokens, etc.) are stored in a secure store (secrets store). This resides inside the GCP project

the secret store relies on GCP’s native KMS (key management service), which rotates the key every month

the content connectors query the secrets store to fetch the credentials needed to make API calls to the enterprise applications. Fetched and then discarded. They’re not stored in any other component

Tasks Queue

use several cloud tasks queues for managing crawl tasks.

the crawlers will perform a small part of the crawl and then themselves post messages into the tasks queue to schedule crawls of the remaining parts

potentially. If the queue is not set up correctly or is broken then things might not be updating

Scheduler

CRON (used to kick off the crawls periodically)

SQL (for Config Store, Secrets Store, Document Store and Identities and Permissions store)

Cloud SQL instances running MySQL

the instances have private IPs

enable client TLS (transport layer security - cryptographic protocol designed to provide secure communication over a computer network such as the internet. It’s the successor to the Secure Sockets Layer (SSL) protocol and used to encrypt data as it’s transmitted between two endpoints)

we run Cloud SQL proxies that allow the other subsystems to connect to the SQL instances without having to go through the public IPs.

Dataflow

google managed dataflow pipeline used to indexing the crawled content

internal to the cloud project

use private IPs for the dataflow workers

Pub Sub

use google cloud pub sub for triggering dataflow pipeline

internal to the cloud project

Elastic search in Kubernetes

run one or more instances of elastic search in google Kubernetes engine clusters

Internal to the cloud project

shielded VMs, antivirus & file integrity monitoring

enable master authorized networks on the per-cluster GKE master (API server) for extra protection and is a proxy for requests to Elastic search.

operate over HTTPS

authenticated via Google IAM

alot of things are authenticated via google IAM and so therefore I wonder if you can tell that google IAM is not the problem then because nothing else is broken. Like do they all share the same key or nah?

Query Endpoint (QE)

search frontend

app engine app (public IP but only HTTPS)

service that our client code (running in the browser) talks to

the endpoint enforces authentication via OAuth2.0/SAML from a customer-specific provider (eg. Okta)

Crawler

app engine service

does not accept any request from outside the project

invoked by the Cloud Tasks and Cloud Scheduler services

enforce that the requests are coming from these services using GCP provided mechanisms

Datasource web hook handler

app engine service

has public endpoints that are used for receiving web hook events from datasources

requests from datasource are authenticated using secret established out of band btwn the Glean system and the datasource

return no data

no other requests allowed

serve primarily as crawl hints and trigger the system to make the API call to fetch the modified content

hence. if the webhook handler is broken. then the crawl hints to trigger the system to make the API call to fetch the modified content wouldn’t happen

Query blacklisting controls

project admin can configure text file in project’s cloud storage bucket with list of terms that shouldn’t appear in the search query

if the user has a query matching one of the terms an error will be returned

check this too see if the terms/regexes are too broad a list

Bastion Host

small GCE VM used as a proxy for maintenance operations that need to interact with nodes that only have private IPs (SQL, GKE)

host only has private IPs

we use SSH via IAP (identity aware proxy) tunneling to connect to it

Glean Central Project

hosts static files, DNS, and central applications

Some anonymized logs and metrics also sent to central project for analytics

used for easier and secure integration of various SaaS applications with Glean (e.g Slack, Github, etc.)

some webhook content passes through the central project

never store any customer data in the central project

all communication btwn central project and customer projects happen over https and is additionally secured by encryption/content signature using shared GCP KMS keys

if this is undesirable, can configure applications to use isolated apps which comes with some additional setup and maintenance cost

⁠

https://docs.google.com/document/d/1mWen9QXh8fZ9ZN9saLjFGEIKMqABbMuhgM231l2AZT4/edit#⁠

⁠

scio-apps - the central GCP project that’s owned and managed by the Glean team.

⁠

https://docs.google.com/document/d/1ivVlxk7ibclnaazlzshwSKzn-b_cqiaoKXWYlVXofjY/edit#heading=h.uuwc75dm8zbx⁠

The Glean system runs in an isolated tenant (a single GCP project) that can be hosted either in the customer’s cloud account or Glean’s central cloud account (based on the choice made by the customer). In general, everything to do with the customer’s Glean instance is stored in that tenant and no information leaves the tenant. However, there are some anonymized logs/metrics that are sent to Glean’s central cloud service (which lives outside of the customer’s tenant). This document describes the data that is shared outside of the customer’s tenant.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.