(slop) Update readme

This commit is contained in:
Viktor Lofgren 2024-08-04 10:58:23 +02:00
parent 9bc665628b
commit c379be846c

View File

@ -25,7 +25,7 @@ average-age.0.dat.f64le.gz
The slop library offers some facilities to aid with data integrity, such as the SlopTable The slop library offers some facilities to aid with data integrity, such as the SlopTable
class, which is a wrapper that ensures consistent positions for a group of columns, and aids class, which is a wrapper that ensures consistent positions for a group of columns, and aids
in closing the columns when they are no longer needed. in closing the columns when they are no longer needed. Beyond that, you're on your own.
## Why though? ## Why though?
@ -44,26 +44,42 @@ than a parquet file containing the equivalent information.
Slop is simple. Slop is simple.
There isn't much magic going on under the hood in Slop. It's designed with the philosophy that a competent programmer There isn't much magic going on under the hood in Slop. It's designed with the philosophy that a competent programmer
should be able to reverse engineer the format of the data by just should be able to reverse engineer the format of the data by just looking
looking at a directory listing of the data files. at a directory listing of the data files. Despite being a very obscure library,
this gives the data a sort of portability.
### Relaxed 1BRC (no CSV ingestion time) ### Relaxed 1BRC (no CSV ingestion time)
Slop is reasonably competitive with DuckDB in terms of read speed, A benchmark against DuckDB, which is another excellent columnar storage library, albeit
especially when reading from Parquet, and the data on disk tends one that is more featureful and safe than Slop is.
to be smaller.
This is noteworthy given Slop is a single-threaded JVM application, The benchmark is a relaxed 1BRC, aggregate a billion rows of temperature data by city,
and DuckDB is a multi-threaded C++ application. and then calculate max/min/avg. This omits the CSV ingestion time from the original
challenge, which means the numbers are not directly comparable with other 1BRC benchmarks.
| Impl | Runtime | Size On Disk | | Impl | Runtime | Size On Disk |
|----------------------------|---------|--------------| |-----------------------------------------|---------|--------------|
| DuckDB in memory | 2.6s | 3.0 GB | | Parallel Slop, s16 | 0.64s | 2.8 GB |
| Slop in vanilla Java s16 | 4.2s | 2.8 GB | | Parallel Slop, varint | 0.90s | 2.8 GB |
| Slop in vanilla Java s32 | 4.5s | 3.8 GB | | DuckDB<sup>1</sup> | 2.6s | 3.0 GB |
| Parquet (Snappy) in DuckDB | 4.5s | 5.5 GB | | Slop, s16 | 4.2s | 2.8 GB |
| Parquet (Zstd) in DuckDB | 5.5s | 3.0 GB | | Slop, s32 | 4.5s | 3.8 GB |
| Parquet<sup>2</sup> (Snappy) in DuckDB | 4.5s | 5.5 GB |
| Parquet<sup>2</sup> (Zstd) in DuckDB | 5.5s | 3.0 GB |
| JDBC<sup>3</sup> | 6500s | 3.0 GB |
<sup>[1]</sup> Benchmark loads the data into DuckDB's native table format,
performs an aggregation within the database, and then fetches the results via JDBC.
<sup>[2]</sup> Benchmark loads the data from Parquet in DuckDB, performs an
aggregation within the database, and then fetches the results via JDBC.
<sup>[3]</sup> Benchmark loads the data into DuckDB's native table format,
then streaming it as-is over JDBC to Java for processing, with fetch size = 1000.
This is a very common usage pattern in Enterprise Java applications, although
usually you'd have an ORM in between the JDBC and the application code adding even
more overhead. The numbers are extrapolated from a 100M benchmark, as I value my time.
## Example ## Example
@ -131,7 +147,9 @@ record Population(String city, int population, double avgAge) {
## Nested Records ## Nested Records
TBW Nested records are not supported in slop, although array values are supported. If you need to store nested records,
you've got the options of flattening them, representing them as arrays, or serializing them into a byte array and
storing that.
## Column Types ## Column Types