Cloud articles
Share
Explore

icon picker
GKE vs Autopilot vs Cloud Run analysis

14 minutes read

❓ Problem statement

As a: PO
I want to: understand the pros/cons of running services on Cloud Run compared to an AKS-like stack using GKE on GCP

💡 Research insights

Who we (MCS) are?
What do we want to abstract the product teams from?
What value do we want to bring to the teams?
Should we take the niche of complex stateful applications only?

📊 Solution hypothesis

We want to abstract the infra-complexity for the development teams and still be able to tweak the full range of configurations available.
We will never be that cost-effective as managed solutions from the providers.

Outcome

Cloud Run is advised by the Google team to the product teams for the stateless applications as a first-hand choice
GKE Autopilot is advised by the Google team to the product teams to set up for other kinds of applications independently
GKE Autopilot main concerns:
1. Service discovery. Currently, no service mesh is in place, but Istio is on the roadmap for 2021.
2. No runtime security scan for containers. Prisma cloud should be integrated.
3. Monitoring. Should the google monitoring tools be used or Splunk + Prometheus + Grafana be installed.
APIGee is not yet enabled in-house.
Projects in one region to use a shared subnet. Projects in different regions will share the same VPC.

⚖ Arguments

Team's Customer Journey in GCP now always starts from the GCP Core team by choosing what type of project to request.
GCP project provisioning is now already wrapped into a product/service offering by the GCP Core team
Org Node + Folders + Projects are managed by GCP Core team
We can either integrate to the GCP Core team to contribute to the offerings or add value on top of it
Our value - reducing the complexity, accelerating of a product team
Our deliverables: Decision trees, IaC for infra components?
Decision trees for computing: Cloud Functions vs App Engine vs Cloud Run vs GKE vs Anthos Clusters vs Cloud Run for Anthos
IaC to roll out infra changes

🏐 Responsibility schema

image.png

Mapping of components

Too Long; Didn't Read version: A managed tool removes complexity by focusing on the core problem
Mapping of components
0
MCS Component
Function
GKE Autopilot option
Pricing in €
Complexity
1
2
0. Network
To network all the components
VPC. Subnetworks.
Free
Deployment: ⭐⭐⭐🌑 🌑 - okey. The design must come first. Terraform skills are required for IaC. Operations: ⭐⭐⭐⭐🌑 - easy. The design must come first. Terraform skills are required for IaC. Not a subject of an often change.
3
cert-manager
Managing of SSL certificates
Certificates are provided by Google, no support for 3rd party webhooks. Google-managed SSL certificates aren't supported for regional external HTTP(S) load balancers and internal HTTP(S) load balancers. For these load balancers, use self-managed SSL certificates. https://cloud.google.com/kubernetes-engine/docs/how-to/managed-certs
Free https://cloud.google.com/load-balancing/docs/quotas#ssl_certificates
Google-managed SSL certificates: Deployment: ⭐⭐⭐⭐⭐ -very easy. A built-in tool in GCP. You need to activate an API. IaC is availible. Operations: ⭐⭐⭐⭐⭐ - very easy. Certificates are prolonged automatically.
4
2. external-dns
DNS discovery
Cloud DNS NodeLocalDNSCache (free, an addon to kube-dns) Possible to install external-dns and kube-dns
DNS Query (port 53): 0.344299999 per 1,000,000 requests ManagedZone: 0.172149999 per month
Cloud DNS: Deployment: ⭐⭐⭐⭐⭐ -very easy. A built-in tool in GCP. You need to activate an API, IaC is availible. Skills and extended capacity are not needed to start with. Basic terraform is needed if going with IaC. Operations: ⭐⭐⭐⭐⭐ - very easy. Console, commands, IaC is availible. Basic terraform is needed for IaC. Interest is max. external-dns: Deployment: ⭐⭐⭐🌑 🌑 okey. A custom tool, implementation needs to be converted in a Terraform template. Operations: ⭐⭐⭐⭐🌑 easy. Managing the custom Terraform template. Interest: ok. It’s a custom tool, but it’s free.
5
3. gatekeeper-system
Pod security policies
S1:Admission Controllers for pod security policies. External gatekeeper will be supported in Q2 2022 in Autopilot.
Included
Only Admission Controllers are availble for pod security policies. For now - there is nothing to manage, but a custom tool, coming in Q2022: Configuration: ⭐🌑 🌑 🌑 🌑 The one should know the theory and what they are doing. Terraform skills is a must for IaC. Will take some capacity as well. Operations: ⭐⭐🌑 🌑 🌑. Shouldn’t be a subject of an often change.
6
4. ingress-nginx
N1: Egress proxy, NAT, firewall N2: Exposing apps
N1: Cloud NAT (private clusters only) N2: HTTP load balancing (also possible to install ingress-nginx) https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api
N1 NAT Gateway: Data processing charge in Frfrt 0.038733749 per GB NAT Gateway: Uptime charge in Frankfurt 0.001205049 per hour N2:https://cloud.google.com/vpc/network-pricing#lb Network Load Balancing: Data Processing Charge in Frankfurt: 0.008607499 per GB Network Internal Load Balancing: Data Processing Charge in Frankfurt: 0.008607499 per GB Internal HTTP(S) Load Balancing: Proxy Instance Charge in EMEA 0.021518749 per hour Networking Cloud Load Balancing HTTPS External Regional Inbound Data Processing Europe 0.006972074 per hour Network Load Balancing: Forwarding Rule Minimum Service Charge in Frankfurt: 0.025822499 per hour Network Load Balancing: Forwarding Rule Additional Service Charge in Frankfurt: 0.010328999 per hour
N1 - Cloud NAT: Deployment: ⭐⭐⭐⭐🌑 - easy. A built-in tool in GCP. You need to activate an API, IaC is availible. Basic network design is needed to start with. Basic terraform is needed if going with IaC. Operations: ⭐⭐⭐⭐⭐ - very easy. Nothing to maintain. Console, commands, IaC is availible. Basic terraform is needed for IaC. Interest is max. N2: HTTP load balancing - recommended and chosen Deployment: ⭐⭐⭐🌑 🌑 - okey. A built-in tool in GCP. You need to activate an API, IaC is availible. Still you need to design the network, rules create the loadbalancer for your apps afterwards. Terraform skills are needed if going with IaC. Operations: ⭐⭐⭐⭐🌑 - easy. Rules and load balancers are to maintain with console, commands or IaC. Terraform is needed for IaC. ingress-nginx: Deployment: ⭐🌑 🌑 🌑 🌑 - hard. And VERY HARD to replace the NAT. The detailed network design is required to configure the rules and extended terraform skills are required. Operations: ⭐⭐⭐🌑 🌑 - okey. Rules and load balancers are to maintain via Terraform. Interest: Compare the man-month to develop vs google managing fees.
7
5. kured*
Nodes management
Nodes are managed by google
Included in the compute cost
N/A
8
6. linkerd*
Service mesh
No. Istio is on roadmap for 2021. https://cloud.google.com/service-mesh/docs/managed/asmcli-experimental#gke-autopilot
To request the roadmap
Recommended to use.
9
7. opentelemetry
Collecting telemetry
Cloud Logging, Cloud Trace, and Error Reporting (opentelemetry integration is on the roadmap) https://hmgroup.atlassian.net/wiki/spaces/BTIGC/pages/713883827 https://cloud.google.com/stackdriver/docs/solutions/gke/managing-metrics#workload-metrics
Cloud Logging, Cloud Trace, and Error Reporting : Deployment: ⭐⭐⭐⭐⭐ - very easy. Built-in tools in GCP. You need to activate an API. Skills and extended capacity are not needed to start with. Operations: ⭐⭐⭐⭐⭐ - very easy. Console, commands. No real need to maintain Interest is max. It’s based on the opentelemetry, but run managed. opentelemetry on-prem Deployment: ⭐⭐⭐🌑 🌑 - okey. To use if no GCP tools are used for monitoring. To configure to send logs to Splunk. Operations: ⭐⭐⭐⭐🌑 - easy. No real need to maintain. If there is a need - terraform skills are required.
10
8. Prom-Operator
Monitoring
Cloud Monitoring https://hmgroup.atlassian.net/wiki/spaces/BTIGC/pages/713883827
Cloud Monitoring: Deployment: ⭐⭐⭐⭐⭐ - very easy. A built-in tool in GCP. You need to activate an API. Skills and extended capacity are not needed to start with. Operations: ⭐⭐⭐⭐⭐ - very easy. Console, commands. No real need to maintain Interest is max. Prometheus Deployment: ⭐🌑 🌑 🌑 🌑 - hard. Still is not done in MCS. Operations: ⭐⭐ 🌑 🌑 🌑 - somewey hard. Extended terraform is required.
11
9. twistlock
S1: Security Scanning of images S2: Runtime security scan of executing containers
S1: https://cloud.google.com/container-analysis/docs/container-scanning-overview S2: Container Threat Detection is not yet available in Autopilot. Is in the roadmap 2022. To integrate Prisma Cloud, if this requiriment is a must. Mor Shabi <mshabi@paloaltonetworks.com> as a contact
S1: $0.26 per scanned container image https://cloud.google.com/container-analysis/pricing
Container scanning: Deployment: ⭐⭐⭐⭐⭐ - very easy. A built-in tool in GCP. You need to activate an API. Skills and extended capacity are not needed to start with. Operations: ⭐⭐⭐⭐⭐ - very easy. Will scan all the new images in the available repos on Create. Interest is max. Prisma Cloud: Deployment: ⭐⭐ 🌑 🌑 🌑 - expected to be someway hard. The connector GCP-PrismaCloud is in beta now. Operations: ⭐⭐ ⭐ 🌑 🌑 - ok. Has to be integrated to the build pipelines. Terraform is required.
12
10. velero
Backup
Backup for GKE (preview)
in beta now.
No data yet
There are no rows in this table

🌈 Design options

Design options
0
Column 1
AKSMCS accelerator
GKE Autopilot
Cloud Run
GKE
Cloud Run for Anthos
1
2
Can be run simultaneously
:plus:
:plus:
:plus:
:plus:
:plus:
3
The sence
The Golden Pattern, we will compare every option to
Auto health monitoring, health checks and auto-repair. Pay-for-pod. No need to calculate resource limits. Managed control & data plane
Cloud Run is a managed serverless compute platform that helps you run highly scalable containerized applications that can be invoked via web requests or Pub/Sub events. Open standard Knative, google take management of control and data plane: autoscaling, load balancing, health checks, auto-repair are managed.
Google Kubernetes Engine (GKE) is a managed Kubernetes service that facilitates the orchestration of containers via declarative configuration and automation. GKE runs Certified Kubernetes, the control plane is mostly managed by Google
A platform to manage multiple clouds at one place
4
Pricing
Pay per node
Pay per pod
Pay-per-use
Pay per node (VM)
Included as part of Anthos.
5
Eliminating the idle cost
:minus: (current Azure costs are for 95% idle)
:plus: :minus: Pods have limits set on requests
:plus: (possible to always allocate CPU and memory)
:minus:
:minus:
6
Scaling
:plus: Cluster is autoscaled
:info: Pre-configured: Autopilot handles all the scaling and configuring of your nodes. Default: To configure Horizontal pod autoscaling (HPA) To configure Vertical Pod autoscaling (VPA)
:info: Automatic
:plus: Optional: Node auto-provisioning To configure cluster autoscaling. HPA VPA
:plus: Optional: Node auto-provisioning To configure cluster autoscaling. HPA VPA
7
Secret Manager
:plus: :minus: Pay for use
:plus: :minus: Pay for use
:plus: Included
:plus: :minus: Pay for use
:plus: Included
8
Cloud Monitoring, Cloud Logging, Cloud Trace, and Error Reporting
:plus: Monitoring and logging: L2: Dynatrace grafana, servicemonitors Prometheus Operator Splunk L1: Logging Layer - Fluentd
:plus: Pre-configured: L1: System and workload logging System monitoring Optional: L2: System and workload monitoring :minus: :plus: Most external monitoring tools require access that is restricted. Solutions from several Google Cloud partners are available for use on Autopilot, however not all are supported, and custom monitoring tools cannot be installed on Autopilot clusters
:plus: Included, only GCP tools
:plus: Pay for use Default: System and workload logging System monitoring Optional: System-only logging System and workload monitoring
:plus: Included, only GCP tools
9
Node configuration upgrading, scaling, OS-related, SSH access, privileged pods
:plus: IaC kured
:minus: It’s the main feature In order to troubleshoot Autopilot nodes a user should contact Cloud Customer Care to obtain a member name that is required to access the cluster.
:minus:
:plus:
:plus:
10
Sidecar (Istio, etc…)
:plus:
:minus: NOT yet, but already experimentally supported - https://cloud.google.com/service-mesh/docs/unified-install/managed-asmcli-experimental :minus: MutatingWebhook Configuration :minus: Linux “NET_ADMIN”
:minus:
:plus:
:plus:
11
Scale-to-zero container
:plus: Ability to install Knative
:plus: Ability to install Knative
:plus: Main feature!
:plus: Ability to install Knative
:plus:
12
Custom machine types
:minus: :minus: Preemptible VMs
:minus: Limited CPU and Memory
:plus:
:plus: Standard or custom machine types on Anthos, including GPUs.
13
GPU, TPU
:minus: Currently
:minus:
:plus:
:plus:
14
Nodes per cluster
400. Possibility to lift this quota
15,000 for GKE versions 1.18 and later.
15
Pods per node
Up to 32 :plus: DaemonSet Pods
only 1
Up to 110
Up to 110
16
Containers per cluster
300,000
Up to 1,000 container instances by default, can be increased via a Quota increase.
300,000
300,000
17
Image type
Pre-configured: Container-Optimized OS with containerd
One of the following: Container-Optimized OS with containerd Container-Optimized OS with Docker Ubuntu with containerd Ubuntu with Docker Windows Server LTSC Windows Server SAC
18
Maximum memory size, in GB
Any
CPUs in Autopilot is available in 0.25 increments (0.01 for DaemonSets) and must be in the ratio of 1:1 to 1:6.5 with memory. Autopilot replaces limits with given requests
16Gi max per container instance Files written to the local filesystem count towards available memory and may cause container instance to go out-of-memory and crash.
Any
19
Storage
:plus:
:plus: only: "configMap", "csi", "downwardAPI", "emptyDir", "gcePersistentDisk", "hostPath", "nfs", "persistentVolumeClaim", "projected", "secret"
:minus: No storage volumes or persistent disks. :plus: Only GCP services
:plus:
:minus: "configMap" is in beta
20
Networking
:plus: N1: Ingress controller on nginx with cert-manager and external-dns N2: network security groups
:plus: Pre-configured: VPC-native (alias IP) Maximum 32 Pods per node Intranode visibility NodeLocalDNSCache N1: HTTP load balancing N2: Admission Controllersfor pod security policies Default: Public cluster Default CIDR ranges Network name/subnet Optional: Private cluster (must for us) N1: Cloud NAT (private clusters only) Authorized networks (must for us) Network policy (must for us)
Access to VPC / Compute Engine network via Serverless VPC Access. Services cannot be part of the Istio service mesh.
Optional: VPC-native (alias IP) Maximum 110 Pods per node Intranode visibility CIDR ranges and max cluster size Network name/subnet Private cluster Cloud NAT Network policy Authorized networks
Access to VPC / Compute Engine network. Services participate in the Anthos Service Mesh.
21
DNS service discovery
:plus: NodeLocalDNSCache preconfigured
:minus: must use the full (*.run.app) URL :plus: possible with third-party runsd
:plus: Optional
:plus: Optional
22
Supported protocols (only)
:plus:
:plus: HTTP/1, HTTP/2, WebSockets, gRCP, Pub/Sub push events :minus: GraphQL, HTTP/2 Server Push
:plus:
:plus: HTTP/1, HTTP/2, WebSockets, gRCP, Pub/Sub push events :minus: GraphQL
23
Request timeout
up to 60 minutes
24
Latency
https://blog.yongweilun.me/gke-ingress-is-slower-than-you-think https://cloud.google.com/blog/products/networking/using-netperf-and-ping-to-measure-network-latency
The average latency is: <.21s - 50% < 1s - 95% < 2s - 99%
25
Security
:plus: S1: open-policy-agent/gatekeeper for pod security policies S2: PrismaCloud
Pre-configured: Workload Identity Shielded nodes Secure boot Workload Identity Filestore CSI driver S1:Admission Controllersfor pod security policies Optional: Customer-managed encryption keys (CMEK) Application-layer secrets encryption Google Groups for RBAC :minus: NOT supported: Binary authorization Kubernetes Alpha APIs Legacy authentication options S2: Container Threat Detection OPA Gatekeeper Policy Controller
:plus: Managing access using IAM
Optional: Workload Identity Shielded nodes Secure boot Application-layer secrets encryption Binary authorization Customer-managed encryption keys (CMEK) Google Groups for RBAC Compute Engine service account Workload Identity
:plus: Same as GKE
26
URLs
letsencrypt
Custom domains only with manual SSL certificates.
Automatic service URLs and SSL certificates.
Custom domains only with manual SSL certificates.
Custom domains only with manual SSL certificates.
27
Upgrades, repair, and maintenance
Managed by us
Pre-configured: Node auto-repair Node auto-upgrade Maintenance windows Surge upgrades
Managed.
Optional: Node auto-repair Node auto-upgrade Maintenance windows Surge upgrades
Optional: Node auto-repair Node auto-upgrade Maintenance windows Surge upgrades
28
Zero downtime
Managed by us
Managed by us
:plus: Managed splittingtraffic between different revision
Managed by us
:plus: Managed splittingtraffic between different revision
29
Container isolation
Default Kubernetes container isolation.
k8s Admission Controllers and the seccomp profile is preconfigured.
Strict container isolation based on gVisor sandbox.
Default Kubernetes container isolation.
Default Kubernetes container isolation.
30
Create a Certificate Signing Request
:plus: cert-manager
:minus: Certificates are provided by Google, no support for 3rd party webhooks https://cloud.google.com/kubernetes-engine/docs/how-to/managed-certs
:minus: Certificates are provided by Google
:plus:
:minus: Certificates are provided by Google
31
Execution environments
Fully managed on Google infrastructure.
GKE on Google Cloud
GKE on Anthos
32
Cluster add-ons
:plus: CA1: Service mesh: Istio linkerd
Pre-configured: HTTP load balancing Default: Compute Engine persistent disk CSI Driver NodeLocal DNSCache :minus: NOT supported: Calico network policy Cloud Build Cloud Run Cloud TPU Config Connector Graphics processing units (GPUs) CA1:Istio on Google Kubernetes Engine Kalm Usage metering Read more: https://cloud.google.com/files/shifting-left-on-security.pdf https://cloud.google.com/security/infrastructure/design
Only GCP services
Optional: Compute Engine persistent disk CSI Driver HTTP load balancing NodeLocal DNSCache Cloud Build Cloud Run Cloud TPU Config Connector Istio on Google Kubernetes Engine Kalm Usage metering
Same as GKE
33
Backup
Velero https://dev.azure.com/hmcloud/MCS-Build/_git/mcs-platform-velero
Backup for GKE (preview)
Backup for GKE (preview)
Backup for GKE (preview)
34
SLA
Kubernetes API and node availability
Service Level Objective of at least 99.95%
Kubernetes API availability
Kubernetes API availability
There are no rows in this table
Also referring to part 1 of the analysis

🔥 GKE Autopilot

In the future: the ability to convert the GKE Autopilot into the GKE Standard cluster with preserving of all the k8s config.

IaC pattern with Pulumi

Networking

Cloud NAT to prowide external IP to resources in the private clusters
Cloud NAT implements outbound NAT in conjunction with a default route to allow your instances to reach the Internet. It does NOT implement inbound NAT. Hosts outside of your VPC network can only respond to established connections initiated by your instances; they cannot initiate their own, new connections to your instances via NAT.
Also, to cost some costs we can use to send the traffic internally.
642c4e7c-b4c3-40af-9965-a1f33c0e0c75.png

💎 Cloud Run

Cloud Run — Google’s own Implementation of Knative, an open-source platform built on top of Kubernetes and Istio. The layer between core infrastructure services and developer experience, like the MCS Accelerator.
Knative has three core elements:
Build: This component is responsible for generating a container image from source code.
Serving: This component simplifies exposing an app without having to configure resources such as ingress and a load balancer.
Eventing: Eventing provides a mechanism to consume and produce events in a pub/sub style. This is useful for invoking code based on external or internal events.

With the Cloud Run no need to:

Create a Docker file with all the dependencies and installation steps,
Build the container image from the Docker file,
Push the image to the container registry,
Create a Kubernetes YAML file for the Deployment with the container,
Create another YAML file for exposing the Deployment as a Service,
Deploy the Pod and Service,
Access the App through the endpoint.

Is my app a good fit for Cloud Run?

In order to be a good fit for Cloud Run, the app needs to meet all of the following criteria:
Serves requests, streams, or events delivered via HTTP, HTTP/2, WebSockets, or gRPC.
Does not require a local persistent file system. See the for more information.
Does not require more than .
Meets one of the following criteria:
Is containerized.
You can otherwise containerize it.
Other kinds of applications may not be fit for Cloud Run.
If your application is doing processing while it’s not handling requests or storing in-memory state, it may not be suitable.
A container instance can be shut down at any time, including container instances kept warm via a minimum number of instances, it will not be kept idle for longer than 15 minutes, but still can handle a new request faster, than a cold service. No more, than 10 sec of CPU are allocated to the pod.
After 15 minutes, if no warm instances are configured, the service will be scaled to 0
Cloud Scheduler is used to regularly request the services to keep it alive.
55058024-9e15-41ee-88f3-1b57e597318c.png
d7c18d27-beaa-4ee5-85a7-7b4057fb68ec.png

Networking

1d234e25-7b44-41df-a04e-e5524c6692fe.png
Using an external HTTP(S) load balancer to route requests to serverless backends
Firewall rules are in place when using the VPC Network - .
Can connect to a VPC network (including shared VPC). Is used to connect to the services from the same project, for example.
Public or private service can be triggered by:
an HTTPS request
using gRPC
Triggering from
Running services on a
Executing
Running services with
as a
Receiving a

The IAM invoker permission is always enforced.
Available ingress settings.
All - makes the service reachable from the internet.
For public services the unauthenticated invocations are allowed.
Internal - only requests originating from a VPC network, Pub/Sub or Eventarc within the same project or VPC Service Controls perimeter are allowed to reach the service.
Internal and Cloud Load Balancing - Only accepts internal requests and requests coming through HTTP(S) Load Balancing.
, , or , can be used to prevent any bypassing of these by accessing the default URL.

Internal services

When accessing internal services, call them as you would normally do using their public URLs, either the default run.app URL or a set up in Cloud Run.
For requests from other Cloud Run services or from Cloud Functions in the same project, connect the service or function to a VPC network and route all egress through the .
Requests from resources within VPC networks in the same project are classified as internal even if the resource they originate from has a public IP address.
Services are using
There is no way to call internal services from traffic sources that don't originate from a VPC network, except for Pub/Sub or Eventarc. This means that Cloud Scheduler, Cloud Tasks and Workflows cannot call internal services.

Cloud Run for Anthos

With Cloud Run for Anthos, we actually get a Knative installation (managed by Google) on the GKE cluster
The service is the main resource of Cloud Run for Anthos. Each service is located in a specific GKE cluster namespace.
Each service exposes a unique endpoint
Supports traffic between different revisions (releases)
0ce5b8cb-e381-4c8d-954a-000a67915e81.jfif
23dbe5c5-f17c-45f5-8bf5-b5b5e5016006.png
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.