Glean Architecture overview

⁠

Screen Shot 2023-03-19 at 6.19.34 PM.png

⁠

The GCP architecture diagram is separated into three parts. There is the Public App Engine Service, Internal App Engine Services, and the GCP private IPs.

Public App Engine Service

User Machines can login and search through the Query endpoint that also does SSO authentication and Query Processing. After which it calls the Query processing app engine services that processes/search API calls and returns the search results.

Query Endpoint → Query processing App engine Services, SQL, and Kubernetes Search Index

Datasource events app engine service is is called from the API from the Glean client’s user machine.

This returns query and visits activity events to provide relevant search results to the user. No information is returned to the user. It also needs to be authenticated.

It also accepts web hooks from the cloud application (slack, salesforce, etc). Must also be authenticated using an app secret. Not SSO because it’s not authenticated as a user.

Internal App Engine Services

Query processing App Engine services:

processes /search API calls and returns the search results

Crawler App Engine Services:

only called from Google Cloud Scheduler and Google Cloud Tasks

GCP Private IPs

SQL store:

Identity information (enforce permissions in search result):

users in each application

roles/groups those users are members of

Cloud Storage bucket:

document content from SQL → parse and process it using NLP and ML pipelines →

store ML models

entity concepts, synonyms, antonyms, important phrases, and other artifacts inferred form the parsed content

Ranking pipeline

processes the content from:

Cloud Storage

Document store + Cache/ SQL :

Content Connector Handlers/Crawler App Engine Services:

Datasource Web hooks

Teams/365/Salesforce/other app servers

Result:

Query Response:

populated search index hosted in the Kubernetes cluster

search index:

inverted index that maps a word to the id of the documents the word is a part of

metadata used to populate search result

text content needed to generate the snippets in the search results stored in the index

So the User needs to login. If it’s their first time it goes through the identity Connector Handlers and they’re validated against the Identity and Permissions store using the Scio App. Otherwise if it’s not their first time they are just signed in through SSO authentication and sent to the Query Endpoint. The global-web-app is no longer involved. It’s now the user’s specific company Cloud project and Query endpoint. So their query is then parsed and sent to the Query Processing App Engine Services. After which it goes to the search index and returns the result. The search Index is created from Ranking quality pipeline that uses data from Cloud Storage (ML and NLP) and the Document store+Cache to create its search Index. The search index is also created from the Indexer which is built from the Doc builder pipeline from the Pub-sub system that pulls data from the third party cloud services like Teams/365/salesforce.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.