The rise of Rust, Data Fusion and Parque format has pushed Data Frame-based processing into a new level of performance and memory efficiency. Here is some popular library in this trend:
: [python] The most mature library of next-gen data-frame in terms of functionality in comparison of pandas. We made some benchmark experiments in July 2021 and found it is better at handling big files then pandas.
: [python] “With the SpyQL command-line tool you can make SQL-like SELECTs powered by Python on top of text data (e.g. CSV and JSON). Data can come from files but also from data streams, such as as Kafka, or from databases such as PostgreSQL.” So, this CLI aims for stream processing
: [go] this tool was developed quite a long time but recently has reimplemented and promised significant performance improvement according to the author and users.
: [rust]: The Streaming Database for Real-time Analytics
WHEN TO USE:
Think about distributed processing, when you have many data processing tasks that can perform in a single data source (file, database) that is still fit in memory (say < 16 GB RAM), a viable solution is to parallelly execute these tasks by the same pattern:
for each task, load data source from a file into memory and process it by SQL then persist the result
Another option is to load data into sqlite then query by SQL, but this approach takes some time for loading the whole data into sqlite, rather than directly working on the data file.
Why SQL: because SQL can apply to multiple environments: file, database, in-memory data, etc ...