DataFlow
QueryFlow
Users perform searches on which is a global web app hosted on Glean’s cloud. And we host the client code here. It doesn’t have complicated logic. It’s really simple and just serves the static files that together form the “client code”. Once they go there. The client code is loaded on the browser. But since you’re not logged in (checks local storage and notices there’s no state) your identity can’t be authenticated and so it requires you to login first The client begins the “login flow”. You type in your email which is used to ask the global-web-app what real server should be used given the email. The global web app responds with “https:customerdomain-be.glean.com/”. That url is an alias for the QE (query endpoint) which is running in the project that hosts all of the customer’s data. That project is under the customer’s IT team and inside the customer’s firewall. Afterwards, everything the user does on the client goes directly to (which is the project’s QE). The global-web-app is no longer involved anymore. The QE authenticates the user using enterprise SSO and serves the user’s queries. The queries and search results are transmitted over HTTPS (post SSO authentication) between the user’s browser and the QE server running inside the customer’s GCP project inside their cloud account Data ingestion flow
For every enterprise data source that’s connected to Glean, they run content and identity connectors in the cloud project that fetch data and permissions map from that source A webhook is a mechanism for sending real-time notifications from one application to another. It enables real-time data synchronization between two different systems or applications. For example, for an e-commerce website, if you wan to receive real-time notifications when a new order is placed so you can process the order as quickly as possible, you would set up a webhook to send a HTTP POST request to your server whenever a new order is placed. Then you can process the request and update the system accordingly. Purposes include: integrating different applications or servicees triggering events or actions based on specific conditions or data changes real-time data synchronization providing notifications to users or administration mostly it provides real time communication and data exchange between different systems or applications to streamline workflows, increase efficiency, and enhance the user experience The connectors run periodically and in response to web hook events. store the fetched information into Glean’s document and identity store. stores newly fetched content from the dataflow pipeline into the secure search index code running inside the GCP project → fetches content from the enterprise applications over HTTPS over the public web (if the app is hosted on the internet, eg. Google Drive) or over a private internal connection (if the app is hosted inside the customer’s network, eg. Jira) Data Processing pipelines
Once the data is fetched it is further processed within the GCP app. All data processing happens using google dataflow pipelines. The data never leaves the project. GLEAN SDLC process
The deployment pipeline only deploys builds that have been validated and signed by the central trusted service Highly unlikely this is caused by an error in the code because they test for vulnerabilities. To detect vulnerabilities we use github vulnerability scanning, GCP security command’s web security scanner, GCP asset scanner, and container registry scanner. the central locked down build service periodically reads code from trusted branches, builds the relevant docker containers and signs using binary authorization. If the service is locked down only release engineers can trigger a build and modify the pipeline. The service also has authorization policies to only allow engineers to trigger release builds and deploys. The central deployment workflow only has the capability to invoke a specific cloud function in the customer’s GCP app and that function takes the name of the release to upgrade to. That system self upgrades to the signed release specified above by downloading the release from a trusted location after verifying the integrity using binary authorization (code is first digitally signed by a trusted entity eg. software vendor or trusted third party. And then when submitted for deployment it’s checked agains the trusted signature and if it matches it can be installed and run.) The releases go through an internal soak and automated and manual QA testing which includes P0 security and permission tests before they’re deployed to customers. To debug things, we can also get access to their project. The Keys are stored in a locked down Vault instance and glean team members investigation the issue can obtain a 1-hour valid key after providing sufficient justification Soc2 (Service Organization Control 2) Type2 compliant is a framework established by the American Institute of CPAs to assess the effectiveness of a company’s internal controls over the security, availability, processing integrity, confidentiality, and privacy of its systems and data. It’s often required for companies that provide services to other companies. Type 2 is determined over a period of time (6 months or so). To get it a company has to demonstrate it’s implemented and adhered to controls that are designed to protect its systems and data over an extended period of time Software upgrade trust model
the customer’s glean GCP project has a “deployer” service account (glean-deployer) whose key is shared with the Glean on-call team. It has minimal IAM privileges. But it has the IAM role to invoke Cloud functions in the GCP project and the ability to view the contents of the config Cloud Storage bucket but nothing else. In order to deploy a software upgrade, glean’s central build server uses the deployer service account’s key to invoke the deploy_build cloud function exposed in the GCP project. Security
The system runs inside a GC project and inherits all security settings of the customer’s cloud environment. therefore if the customer’s cloud environment changed a security setting it could mean that certain apps can no longer run inside the GC project The only system that is exposed as a web service is the Query Endpoint service. It receives a query from a user signed in to Enterprise SSO (Okta) and serves search results as a response back to the user it also receives activity data (search results viewed/clicked by the user, views of whitelisted enterprise applications) reported by the Glean front-end app. All requests to the Query Endpoint require an authenticated cookie issued to the user as part of the SSO login. Potentially, if the authenticated cookie isn’t there they couldn’t get the request to the Query Endpoint and so maybe what is being served is the cache on the frontend. So seeing if you clear the cache if Glean still runs For data sources that support web hooks there are endpoints listening for notifications about modified content from the datasource. The web hook notifications payload will be signed by a secret established out-of-band with the datasource to prevent spoofing. These endpoints return no data and no other requests are allowed. The web hook requests serve primarily as crawl hints and trigger the system to make the API call to fetch the modified content. check if this is a datasource that supports web hooks because if so, the endpoints listening to notifications about modified content from the datasource might be broken and therefore the datasources aren’t being updated User Data Access Enforcement
Glean continuously syncs the ACLs for each document to mirror permissions within our system in near real time Can enforce data access rules at Product, object, and record level Control Plane
All control plane operations (deploying code, data files and configs, updating system metadata, etc) are done via Cloud APIs (eg. gcloud SDK). They are fully authenticated (IAM) and made over secure channels. These operations are logged as part of GCP’s comprehensive admin activity login in Stack driver. since these are done via Cloud APIs, it’s very unlikely that these would change without the previous versions still being supported. It would most likely be either that they changed their configs on their side. Or that there was some IAM permission change * Wonder if can ask Greg what some common mistakes people who have done the collaboration have made.
Logging, Monitoring, Alerting, Tracing, Audit trail
do all of this using StackDriver Query/Activity logs: user identity and other details for each query on Glean System deployment logs: full log for all software upgrade operations Logs stored in GCP project
Non-PII logs (400 day retention) available in Stackdriver GCP console () PII logs are in the glean_sensitive_logs_bigquery and audit_logs BigQuery table in the GCP project (30 day retention) pII : information like employee emails or permission group names. not content stored in the document body. IAM roles for the BigQuery only allow GCP’s admin and not Glean employees only in rare debugging scenarios do Glean employees lookup specific log entries using debugging APIs to debug production issues. This has to be authorized by engineering leadership team comprehensive GCP audit logs (400 day retention) enabled logging changes to GCP system components enabled by default unless GCP organization policies are set to prevent some type of audit logging Glean employees can view admin activity and system events audit logs which don’t contain PII Available for searches and actions done by the customer’s employees in Glean not accessible to Glean employees unless the customer has allowed for debugging purposes GCP storage bucket (270 dat retention) scio-<projectid>-query-endpoint-access: logs for all search queries being made each entry has user identity and the query performed GCP storage bucket (270 data retention) scio-<projectid>-search-query, scio-<projectid>-search-result, scio-<projectid>-search-result-feedback results we return per query clicks/vieews for the results used mainly by our ranking pipeline to improve the search sample GCP log filters: . ex. querying all access from a certain user or service account following: logName:"cloudaudit.googleapis.com"
protoPayload.authenticationInfo.principalEmail="<user or service account email, e.g. 550282177806@cloudbuild.gserviceaccount.com>"
GCP Error Reporting dashboard counts, analyzes, and aggregates the crashes in the running cloud services stack traces visible to Glean employees Anonymized (non-PII) logs sent to Glean’s central server the system sends back anonymized non-PII logs to Glean’s central server scio-apps from the project completely anonymized (user ids, document urls, query terms, and other PII is scrubbed and sent in hashed form to Glean’s central server) the hash can only be used to correlate actions done by a user or on a document in a search session without knowing any details of the user, query, or document sent to a BigQuery table in Glean’s central GCP project by a GCP log sink Workflow for anonymized logs Glean code has specialized APIs for logging info that needs to be anonymized to Google stackdriver to specific named logs
Data Retention
mirrors the data retention policy of the data source that it’s indexing Data Encryption
All of the data is encrypted by GCP natively () using AES 256 or better. All of the in-transit data is over TLS1.2 or better and communication with the VPC is protected by Google. It’s encrypted if it moves out of their infrastructure. () For sensitive content like API tokens, an additional layer of (AES 256) encryption is used using Google managed symmetric keys that are automatically rotated monthly ability to invoke key management service is restricted to very few service accounts in the GCP project using IAM roles Least privilege access model
follows a least privilege access model Owner: customer IT admin will be the project owner. Glean doesn’t have access to this service account. Not used other than for performing exceptional operations like deleting the project. Monitoring: this account has access to Stackdriver monitoring, non-pII logging and alerting dashboards. Glean access. no access to customer data Query debugging: customer can specify a restricted set of search queries that the Glean team can use for debugging product issues. Glean Access details for GCP account - External
Security considerations for system components
Content Connectors
make outbound HTTPS connections to lots of external systems using credentials stored in a Cloud SQL database If the data isn’t being updated maybe one of these credentials is incorrect they also make connections to other Glean components. They’re made using GCP’s client libraries and authenticated using Google account credentials made over secure channels these don’t really seem like they can be broken. But maybe check to see if google is facing issues or has an outage or something and that’s why it’s broken Secrets Store
When an application is connected to Glean, the credentials (client id, secret, tokens, etc.) are stored in a secure store (secrets store). This resides inside the GCP project the secret store relies on GCP’s native KMS (key management service), which rotates the key every month the content connectors query the secrets store to fetch the credentials needed to make API calls to the enterprise applications. Fetched and then discarded. They’re not stored in any other component Tasks Queue
use several cloud tasks queues for managing crawl tasks. the crawlers will perform a small part of the crawl and then themselves post messages into the tasks queue to schedule crawls of the remaining parts potentially. If the queue is not set up correctly or is broken then things might not be updating Scheduler
CRON (used to kick off the crawls periodically) SQL (for Config Store, Secrets Store, Document Store and Identities and Permissions store)
Cloud SQL instances running MySQL the instances have private IPs enable client TLS (transport layer security - cryptographic protocol designed to provide secure communication over a computer network such as the internet. It’s the successor to the Secure Sockets Layer (SSL) protocol and used to encrypt data as it’s transmitted between two endpoints) we run Cloud SQL proxies that allow the other subsystems to connect to the SQL instances without having to go through the public IPs. Dataflow
google managed dataflow pipeline used to indexing the crawled content internal to the cloud project use private IPs for the dataflow workers
Pub Sub
use google cloud pub sub for triggering dataflow pipeline internal to the cloud project
Elastic search in Kubernetes
run one or more instances of elastic search in google Kubernetes engine clusters Internal to the cloud project shielded VMs, antivirus & file integrity monitoring enable master authorized networks on the per-cluster GKE master (API server) for extra protection and is a proxy for requests to Elastic search. authenticated via Google IAM alot of things are authenticated via google IAM and so therefore I wonder if you can tell that google IAM is not the problem then because nothing else is broken. Like do they all share the same key or nah? Query Endpoint (QE)
app engine app (public IP but only HTTPS) service that our client code (running in the browser) talks to the endpoint enforces authentication via OAuth2.0/SAML from a customer-specific provider (eg. Okta) Crawler
does not accept any request from outside the project invoked by the Cloud Tasks and Cloud Scheduler services enforce that the requests are coming from these services using GCP provided mechanisms Datasource web hook handler
has public endpoints that are used for receiving web hook events from datasources requests from datasource are authenticated using secret established out of band btwn the Glean system and the datasource no other requests allowed serve primarily as crawl hints and trigger the system to make the API call to fetch the modified content hence. if the webhook handler is broken. then the crawl hints to trigger the system to make the API call to fetch the modified content wouldn’t happen Query blacklisting controls
project admin can configure text file in project’s cloud storage bucket with list of terms that shouldn’t appear in the search query if the user has a query matching one of the terms an error will be returned check this too see if the terms/regexes are too broad a list Bastion Host
small GCE VM used as a proxy for maintenance operations that need to interact with nodes that only have private IPs (SQL, GKE) host only has private IPs we use SSH via IAP (identity aware proxy) tunneling to connect to it Glean Central Project
hosts static files, DNS, and central applications Some anonymized logs and metrics also sent to central project for analytics used for easier and secure integration of various SaaS applications with Glean (e.g Slack, Github, etc.) some webhook content passes through the central project never store any customer data in the central project all communication btwn central project and customer projects happen over https and is additionally secured by encryption/content signature using shared GCP KMS keys if this is undesirable, can configure applications to use isolated apps which comes with some additional setup and maintenance cost scio-apps - the central GCP project that’s owned and managed by the Glean team.
The Glean system runs in an isolated tenant (a single GCP project) that can be hosted either in the customer’s cloud account or Glean’s central cloud account (based on the choice made by the customer). In general, everything to do with the customer’s Glean instance is stored in that tenant and no information leaves the tenant. However, there are some anonymized logs/metrics that are sent to Glean’s central cloud service (which lives outside of the customer’s tenant). This document describes the data that is shared outside of the customer’s tenant.