icon picker
Glean Architecture overview


Screen Shot 2023-03-19 at 6.19.34 PM.png
The GCP architecture diagram is separated into three parts. There is the Public App Engine Service, Internal App Engine Services, and the GCP private IPs.
Public App Engine Service
User Machines can login and search through the Query endpoint that also does SSO authentication and Query Processing. After which it calls the Query processing app engine services that processes/search API calls and returns the search results.
Query Endpoint → Query processing App engine Services, SQL, and Kubernetes Search Index
Datasource events app engine service is is called from the API from the Glean client’s user machine.
This returns query and visits activity events to provide relevant search results to the user. No information is returned to the user. It also needs to be authenticated.
It also accepts web hooks from the cloud application (slack, salesforce, etc). Must also be authenticated using an app secret. Not SSO because it’s not authenticated as a user.
Internal App Engine Services
Query processing App Engine services:
processes /search API calls and returns the search results
Crawler App Engine Services:
only called from Google Cloud Scheduler and Google Cloud Tasks
GCP Private IPs
SQL store:
Identity information (enforce permissions in search result):
users in each application
roles/groups those users are members of
Cloud Storage bucket:
document content from SQL → parse and process it using NLP and ML pipelines →
store ML models
entity concepts, synonyms, antonyms, important phrases, and other artifacts inferred form the parsed content
Ranking pipeline
processes the content from:
Cloud Storage
Document store + Cache/ SQL :
Content Connector Handlers/Crawler App Engine Services:
Datasource Web hooks
Teams/365/Salesforce/other app servers
Result:
Query Response:
populated search index hosted in the Kubernetes cluster
search index:
inverted index that maps a word to the id of the documents the word is a part of
metadata used to populate search result
text content needed to generate the snippets in the search results stored in the index
So the User needs to login. If it’s their first time it goes through the identity Connector Handlers and they’re validated against the Identity and Permissions store using the Scio App. Otherwise if it’s not their first time they are just signed in through SSO authentication and sent to the Query Endpoint. The global-web-app is no longer involved. It’s now the user’s specific company Cloud project and Query endpoint. So their query is then parsed and sent to the Query Processing App Engine Services. After which it goes to the search index and returns the result. The search Index is created from Ranking quality pipeline that uses data from Cloud Storage (ML and NLP) and the Document store+Cache to create its search Index. The search index is also created from the Indexer which is built from the Doc builder pipeline from the Pub-sub system that pulls data from the third party cloud services like Teams/365/salesforce.


Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.