Streaming in Prefect: contributor HQ

Explore

Notes

Contrib meetings
Contrib meetings

Contrib meeting 7/15/2020

What are some base assumptions based on the use cases of those present?

- AC: mobile games, we get a bunch of events coming in

pipeline now is Google Cloud using Dataflow to process it and PubSub as a messenger

streaming pipeline itself is incredibly straight forward

receive event

looks not gigantic/basic validation

strip a couple things out and add a couple things in

batch them into snowflake

simple point of view: tasks being able to represent

size: pretty small

speed: 10k m/s

does 'at least once' processing now

means that they have to deduplicate at the end

- AH: our flows have a batch mode and a real time mode

decompressed images: numpy arrays of ~20MB, up to 10k we map over

control flow in between; depth first

currently have issue on dask because of out of memory issues

some of their nodes are actors

one image is the data structure of a bag, and sequence of them, and process them in terms of actors (remember at this point in time the bag was here, now its there)

when we do video handling, our sources are async sources

frequency of updates is 20fps - the overhead can't be more than a millisecond for invoking 10 nodes

was playing a lot with `streamz`

his vision is that the topology that you can define in Prefect could be transformed into a `streamz` stream, because it is lightweight and has asyncronous function support

currently: cache everything in the graph that we know is the same between run to run to avoid

JC:

thinking of it more as an 'infinite map', and kicking off downstream parts of the flow

poll or a push model for the event source

push model is the same as hitting the API

it's a webhook

if it's something that is listening for streaming events

brain wants to think of it more as a map

constantly running, and we may do things more dowwnstream of it

AH: re: logging/browsing

flow run per event would be impossible to manage it, because we prefer

useful aspects of inner-flow conditionals and triggers exists for free

potential research route: can we get Prefect into streamz? with a really slim flow

StreamFlowRunner?

backpressure

if your producer is producing faster than your consumer, we need to be able to throttle it at some level however it is involved

how much of this is Prefect’s problem, and where to do we put the limiting?

nice things of PIN 14 model

nice flow run record for every single event

has all the diagnostics that we set

proposal:

PIN 14, but we will microbatch with a config sliding window, add a delay to not wait for a batch

essentially now we have a stateful actor because they are making the decision of when to make a batch (so where is this being stored, and what if it dies?)

what if listen just picks backed up on missed events

related: kubernetes controllers that you check what you've missed in the downtime before moving to new things

list() and a watch() API

on startup you list them, somehow you see what you already processed

then look for changes to the bucket concurrently

latency/overhead is an issue

make some cutoff based on prefect internals

do fancy microbatching ourselves maybe

*** know the latency targets

JC: based on some dask feelings

batching every second (subsecond latency almost certainly a no-go)

depending on what you're running on

kubernetes spin up takes some amount of time

*** can we cache resources more intelligently

*** “at least once” needs for validation of an event

*** can we make a 'superfast' agent that doesn't 'do' as much environment spin up for you?

cons of how to do this today:

- no backpressure mechanism so you just die unless someone puts something on top of their lambda

get around some latency issues

AC: something is going on in Spark Streaming that microbatches after all so that they can amortize the cost of the API overhead with the speed of the stream

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.

Contrib meetingsContrib meetings

Contrib meetings
Contrib meetings