Data Processing

Topic #1: Data Frame Library

The rise of Rust, Data Fusion and Parque format has pushed Data Frame-based processing into a new level of performance and memory efficiency. Here is some popular library in this trend:

⁠

Vaex⁠

: [python] The most mature library of next-gen data-frame in terms of functionality in comparison of pandas. We made some benchmark experiments in July 2021 and found it is better at handling big files then pandas.

⁠

Polars⁠

: [rust, python, nodejs, javascript] Polars is always on my watch-list because of its elegant data frame functions, and multiple platform support.

⁠

Arquero⁠

: [nodejs] a new kid in the block specialized for nodejs, worth giving it a try

WHEN TO USE:

Any data-processing pipeline suit for data-frame, such as data exploration and analysis, ETL or batch processing.

Topic #2: SQL with in-memory data

This is another trend in processing data with 2 approaches:

CLI:

⁠

SpyQL⁠

: [python] “With the SpyQL command-line tool you can make SQL-like SELECTs powered by Python on top of text data (e.g. CSV and JSON). Data can come from files but also from data streams, such as as Kafka, or from databases such as PostgreSQL.” So, this CLI aims for stream processing

⁠

DSQ⁠

: [python] another command line aims to query various data type formats by SQL. P.S: it has another

GUI⁠

version.

⁠

OctoSQL⁠

: [go] this tool was developed quite a long time but recently has reimplemented and promised significant performance improvement according to the author and users.

Library/Service:

⁠

Roapi⁠

: [rust] “Create full-fledged APIs for static datasets without writing a single line of code” powered by columnq library

⁠

MaterializeDB⁠

: [rust]: The Streaming Database for Real-time Analytics

WHEN TO USE:

Think about distributed processing, when you have many data processing tasks that can perform in a single data source (file, database) that is still fit in memory (say < 16 GB RAM), a viable solution is to parallelly execute these tasks by the same pattern:

for each task, load data source from a file into memory and process it by SQL then persist the result

Another option is to load data into sqlite then query by SQL, but this approach takes some time for loading the whole data into sqlite, rather than directly working on the data file.

Why SQL: because SQL can apply to multiple environments: file, database, in-memory data, etc ...

Topic #1: Data Frame Library

WHEN TO USE:

Topic #2: SQL with in-memory data

CLI:

Library/Service:

WHEN TO USE:

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.