(slop) Update readme

2025-02-24 05:18:58 +00:00 · 2024-08-04 10:58:23 +02:00 · 2024-08-04 10:58:23 +02:00 · c379be846c
commit c379be846c
parent 9bc665628b
1 changed files with 34 additions and 16 deletions
--- a/code/libraries/slop/readme.md
+++ b/code/libraries/slop/readme.md
@ -25,7 +25,7 @@ average-age.0.dat.f64le.gz
 The slop library offers some facilities to aid with data integrity, such as the SlopTable
 class, which is a wrapper that ensures consistent positions for a group of columns, and aids 
-in closing the columns when they are no longer needed.
+in closing the columns when they are no longer needed.  Beyond that, you're on your own.
 ## Why though?
@ -44,26 +44,42 @@ than a parquet file containing the equivalent information.
 Slop is simple.
 There isn't much magic going on under the hood in Slop.  It's designed with the philosophy that a competent programmer
-should be able to reverse engineer the format of the data by just
+should be able to reverse engineer the format of the data by just looking 
-looking at a directory listing of the data files.
+at a directory listing of the data files.  Despite being a very obscure library, 
 this gives the data a sort of portability.
 ### Relaxed 1BRC (no CSV ingestion time)
-Slop is reasonably competitive with DuckDB in terms of read speed,
+A benchmark against DuckDB, which is another excellent columnar storage library, albeit
-especially when reading from Parquet, and the data on disk tends 
+one that is more featureful and safe than Slop is.
 to be smaller.
-This is noteworthy given Slop is a single-threaded JVM application,
+The benchmark is a relaxed 1BRC, aggregate a billion rows of temperature data by city, 
-and DuckDB is a multi-threaded C++ application.
+and then calculate max/min/avg.  This omits the CSV ingestion time from the original
 challenge, which means the numbers are not directly comparable with other 1BRC benchmarks.
-| Impl                       | Runtime | Size On Disk |
+| Impl                                    | Runtime | Size On Disk |
-|----------------------------|---------|--------------|
+|-----------------------------------------|---------|--------------|
-| DuckDB in memory           | 2.6s    | 3.0 GB       |
+| Parallel Slop, s16                      | 0.64s   | 2.8 GB       |
-| Slop in vanilla Java s16   | 4.2s    | 2.8 GB       |
+| Parallel Slop, varint                   | 0.90s   | 2.8 GB       |
-| Slop in vanilla Java s32   | 4.5s    | 3.8 GB       |
+| DuckDB<sup>1</sup>                      | 2.6s    | 3.0 GB       |
-| Parquet (Snappy) in DuckDB | 4.5s    | 5.5 GB       |
+| Slop, s16                               | 4.2s    | 2.8 GB       |
-| Parquet (Zstd) in DuckDB   | 5.5s    | 3.0 GB       |
+| Slop, s32                               | 4.5s    | 3.8 GB       |
 | Parquet<sup>2</sup> (Snappy) in DuckDB  | 4.5s    | 5.5 GB       |
 | Parquet<sup>2</sup> (Zstd) in DuckDB    | 5.5s    | 3.0 GB       |
 | JDBC<sup>3</sup>                        | 6500s   | 3.0 GB       |
 <sup>[1]</sup> Benchmark loads the data into DuckDB's native table format, 
 performs an aggregation within the database, and then fetches the results via JDBC.
 <sup>[2]</sup> Benchmark loads the data from Parquet in DuckDB, performs an 
 aggregation within the database, and then fetches the results via JDBC.
 <sup>[3]</sup> Benchmark loads the data into DuckDB's native table format, 
 then streaming it as-is over JDBC to Java for processing, with fetch size = 1000.
 This is a very common usage pattern in Enterprise Java applications, although
 usually you'd have an ORM in between the JDBC and the application code adding even
 more overhead.  The numbers are extrapolated from a 100M benchmark, as I value my time.
 ## Example
@ -131,7 +147,9 @@ record Population(String city, int population, double avgAge) {
 ## Nested Records
-TBW
+Nested records are not supported in slop, although array values are supported.  If you need to store nested records,
 you've got the options of flattening them, representing them as arrays, or serializing them into a byte array and 
 storing that.
 ## Column Types