From c379be846c01bb8eef66b7d7e6fdd42ef2b45628 Mon Sep 17 00:00:00 2001 From: Viktor Lofgren Date: Sun, 4 Aug 2024 10:58:23 +0200 Subject: [PATCH] (slop) Update readme --- code/libraries/slop/readme.md | 50 ++++++++++++++++++++++++----------- 1 file changed, 34 insertions(+), 16 deletions(-) diff --git a/code/libraries/slop/readme.md b/code/libraries/slop/readme.md index 99e52782..49ece70c 100644 --- a/code/libraries/slop/readme.md +++ b/code/libraries/slop/readme.md @@ -25,7 +25,7 @@ average-age.0.dat.f64le.gz The slop library offers some facilities to aid with data integrity, such as the SlopTable class, which is a wrapper that ensures consistent positions for a group of columns, and aids -in closing the columns when they are no longer needed. +in closing the columns when they are no longer needed. Beyond that, you're on your own. ## Why though? @@ -44,26 +44,42 @@ than a parquet file containing the equivalent information. Slop is simple. There isn't much magic going on under the hood in Slop. It's designed with the philosophy that a competent programmer -should be able to reverse engineer the format of the data by just -looking at a directory listing of the data files. +should be able to reverse engineer the format of the data by just looking +at a directory listing of the data files. Despite being a very obscure library, +this gives the data a sort of portability. ### Relaxed 1BRC (no CSV ingestion time) -Slop is reasonably competitive with DuckDB in terms of read speed, -especially when reading from Parquet, and the data on disk tends -to be smaller. +A benchmark against DuckDB, which is another excellent columnar storage library, albeit +one that is more featureful and safe than Slop is. -This is noteworthy given Slop is a single-threaded JVM application, -and DuckDB is a multi-threaded C++ application. +The benchmark is a relaxed 1BRC, aggregate a billion rows of temperature data by city, +and then calculate max/min/avg. This omits the CSV ingestion time from the original +challenge, which means the numbers are not directly comparable with other 1BRC benchmarks. -| Impl | Runtime | Size On Disk | -|----------------------------|---------|--------------| -| DuckDB in memory | 2.6s | 3.0 GB | -| Slop in vanilla Java s16 | 4.2s | 2.8 GB | -| Slop in vanilla Java s32 | 4.5s | 3.8 GB | -| Parquet (Snappy) in DuckDB | 4.5s | 5.5 GB | -| Parquet (Zstd) in DuckDB | 5.5s | 3.0 GB | +| Impl | Runtime | Size On Disk | +|-----------------------------------------|---------|--------------| +| Parallel Slop, s16 | 0.64s | 2.8 GB | +| Parallel Slop, varint | 0.90s | 2.8 GB | +| DuckDB1 | 2.6s | 3.0 GB | +| Slop, s16 | 4.2s | 2.8 GB | +| Slop, s32 | 4.5s | 3.8 GB | +| Parquet2 (Snappy) in DuckDB | 4.5s | 5.5 GB | +| Parquet2 (Zstd) in DuckDB | 5.5s | 3.0 GB | +| JDBC3 | 6500s | 3.0 GB | + +[1] Benchmark loads the data into DuckDB's native table format, +performs an aggregation within the database, and then fetches the results via JDBC. + +[2] Benchmark loads the data from Parquet in DuckDB, performs an +aggregation within the database, and then fetches the results via JDBC. + +[3] Benchmark loads the data into DuckDB's native table format, +then streaming it as-is over JDBC to Java for processing, with fetch size = 1000. +This is a very common usage pattern in Enterprise Java applications, although +usually you'd have an ORM in between the JDBC and the application code adding even +more overhead. The numbers are extrapolated from a 100M benchmark, as I value my time. ## Example @@ -131,7 +147,9 @@ record Population(String city, int population, double avgAge) { ## Nested Records -TBW +Nested records are not supported in slop, although array values are supported. If you need to store nested records, +you've got the options of flattening them, representing them as arrays, or serializing them into a byte array and +storing that. ## Column Types