Topic #1: Data Frame Library
The rise of Rust, Data Fusion and Parque format has pushed Data Frame-based processing into a new level of performance and memory efficiency. Here is some popular library in this trend:
WHEN TO USE:
Any data-processing pipeline suit for data-frame, such as data exploration and analysis, ETL or batch processing.
Topic #2: SQL with in-memory data
This is another trend in processing data with 2 approaches:
: [python] “With the SpyQL command-line tool you can make SQL-like SELECTs powered by Python on top of text data (e.g. CSV and JSON). Data can come from files but also from data streams, such as as Kafka, or from databases such as PostgreSQL.” So, this CLI aims for stream processing : [python] another command line aims to query various data type formats by SQL. P.S: it has another version. : [go] this tool was developed quite a long time but recently has reimplemented and promised significant performance improvement according to the author and users.
: [rust] “Create full-fledged APIs for static datasets without writing a single line of code” powered by columnq library : [rust]: The Streaming Database for Real-time Analytics
WHEN TO USE:
Think about distributed processing, when you have many data processing tasks that can perform in a single data source (file, database) that is still fit in memory (say < 16 GB RAM), a viable solution is to parallelly execute these tasks by the same pattern:
for each task, load data source from a file into memory and process it by SQL then persist the result
Another option is to load data into sqlite then query by SQL, but this approach takes some time for loading the whole data into sqlite, rather than directly working on the data file.
Why SQL: because SQL can apply to multiple environments: file, database, in-memory data, etc ...