Skip to content
Streaming in Prefect: contributor HQ
Share
Explore
Notes

Contrib meetings

What are some base assumptions based on the use cases of those present?
- AC: mobile games, we get a bunch of events coming in
pipeline now is Google Cloud using Dataflow to process it and PubSub as a messenger
streaming pipeline itself is incredibly straight forward
receive event
looks not gigantic/basic validation
strip a couple things out and add a couple things in
batch them into snowflake
simple point of view: tasks being able to represent
size: pretty small
speed: 10k m/s
does 'at least once' processing now
means that they have to deduplicate at the end

- AH: our flows have a batch mode and a real time mode
decompressed images: numpy arrays of ~20MB, up to 10k we map over
control flow in between; depth first
currently have issue on dask because of out of memory issues
some of their nodes are actors
one image is the data structure of a bag, and sequence of them, and process them in terms of actors (remember at this point in time the bag was here, now its there)
when we do video handling, our sources are async sources
frequency of updates is 20fps - the overhead can't be more than a millisecond for invoking 10 nodes
was playing a lot with `streamz`
his vision is that the topology that you can define in Prefect could be transformed into a `streamz` stream, because it is lightweight and has asyncronous function support
currently: cache everything in the graph that we know is the same between run to run to avoid

JC:
thinking of it more as an 'infinite map', and kicking off downstream parts of the flow
poll or a push model for the event source
push model is the same as hitting the API
it's a webhook
if it's something that is listening for streaming events
brain wants to think of it more as a map
constantly running, and we may do things more dowwnstream of it

AH: re: logging/browsing
flow run per event would be impossible to manage it, because we prefer
useful aspects of inner-flow conditionals and triggers exists for free

potential research route: can we get Prefect into streamz? with a really slim flow
StreamFlowRunner?

backpressure
if your producer is producing faster than your consumer, we need to be able to throttle it at some level however it is involved
how much of this is Prefect’s problem, and where to do we put the limiting?

nice things of PIN 14 model
nice flow run record for every single event
has all the diagnostics that we set

proposal:
PIN 14, but we will microbatch with a config sliding window, add a delay to not wait for a batch
essentially now we have a stateful actor because they are making the decision of when to make a batch (so where is this being stored, and what if it dies?)
what if listen just picks backed up on missed events
related: kubernetes controllers that you check what you've missed in the downtime before moving to new things
list() and a watch() API
on startup you list them, somehow you see what you already processed
then look for changes to the bucket concurrently

latency/overhead is an issue
make some cutoff based on prefect internals
do fancy microbatching ourselves maybe
*** know the latency targets
JC: based on some dask feelings
batching every second (subsecond latency almost certainly a no-go)
depending on what you're running on
kubernetes spin up takes some amount of time
*** can we cache resources more intelligently
*** “at least once” needs for validation of an event
*** can we make a 'superfast' agent that doesn't 'do' as much environment spin up for you?

cons of how to do this today:
- no backpressure mechanism so you just die unless someone puts something on top of their lambda
get around some latency issues
AC: something is going on in Spark Streaming that microbatches after all so that they can amortize the cost of the API overhead with the speed of the stream

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.