What are some base assumptions based on the use cases of those present?
- AC: mobile games, we get a bunch of events coming in
pipeline now is Google Cloud using Dataflow to process it and PubSub as a messenger streaming pipeline itself is incredibly straight forward looks not gigantic/basic validation strip a couple things out and add a couple things in batch them into snowflake simple point of view: tasks being able to represent does 'at least once' processing now means that they have to deduplicate at the end
- AH: our flows have a batch mode and a real time mode
decompressed images: numpy arrays of ~20MB, up to 10k we map over control flow in between; depth first currently have issue on dask because of out of memory issues some of their nodes are actors one image is the data structure of a bag, and sequence of them, and process them in terms of actors (remember at this point in time the bag was here, now its there) when we do video handling, our sources are async sources frequency of updates is 20fps - the overhead can't be more than a millisecond for invoking 10 nodes was playing a lot with `streamz` his vision is that the topology that you can define in Prefect could be transformed into a `streamz` stream, because it is lightweight and has asyncronous function support currently: cache everything in the graph that we know is the same between run to run to avoid
JC:
thinking of it more as an 'infinite map', and kicking off downstream parts of the flow poll or a push model for the event source push model is the same as hitting the API if it's something that is listening for streaming events brain wants to think of it more as a map constantly running, and we may do things more dowwnstream of it
AH: re: logging/browsing
flow run per event would be impossible to manage it, because we prefer useful aspects of inner-flow conditionals and triggers exists for free
potential research route: can we get Prefect into streamz? with a really slim flow
backpressure
if your producer is producing faster than your consumer, we need to be able to throttle it at some level however it is involved how much of this is Prefect’s problem, and where to do we put the limiting?
nice things of PIN 14 model
nice flow run record for every single event has all the diagnostics that we set
proposal:
PIN 14, but we will microbatch with a config sliding window, add a delay to not wait for a batch
essentially now we have a stateful actor because they are making the decision of when to make a batch (so where is this being stored, and what if it dies?)
what if listen just picks backed up on missed events related: kubernetes controllers that you check what you've missed in the downtime before moving to new things on startup you list them, somehow you see what you already processed then look for changes to the bucket concurrently
latency/overhead is an issue
make some cutoff based on prefect internals do fancy microbatching ourselves maybe *** know the latency targets
JC: based on some dask feelings
batching every second (subsecond latency almost certainly a no-go) depending on what you're running on kubernetes spin up takes some amount of time *** can we cache resources more intelligently
*** “at least once” needs for validation of an event
*** can we make a 'superfast' agent that doesn't 'do' as much environment spin up for you?
cons of how to do this today:
- no backpressure mechanism so you just die unless someone puts something on top of their lambda
get around some latency issues
AC: something is going on in Spark Streaming that microbatches after all so that they can amortize the cost of the API overhead with the speed of the stream