================================ Marko Zajc's 2022 r/place datasets =============================== [third revision] ABOUT: Because Reddit's own datasets are rather shit (20 gigabytes of CSVs (a human readable format) with human readable timestamps, string-encoded integers and colour values, and base 64 encoded hashes), I've decided to pack them into a more storage-efficient format (9/6 bytes per entry) combined with another dataset that maps my clamped user indices to Reddit's fized-size 64-byte hashes (that they represent as base 64). Because Reddit's CSVs are not only hard to parse, but also straight up wrong (see csv/NOTES.txt), I've also parsed the raw r/place canvas images (from Reddit's hot-potato server) into the 'raster' dataset. The parsing code for entries.bin resides in PlacementDataReader.java (read its header comment for more information). The parsing code for hashes.index mapping resides in UserIdReader.java (read its header comment for more information). There are two datasets: - the 'csv' dataset generated from reddit's officially provided CSV files - the 'raster' dataset generated from reddit's stored images (n=2772954) of r/place DIRECTORY STRUCTURE: |- entries.bin a binary file containing entry structs |- hashes.bin a binary file containing hash structs (may be absent) |- cache snapshots of the canvas rendered from entries.bin in 100-second intervals | - 0.png T = 0s | - 1.png T = 100s | - ... and so on |- NOTES.txt notes and remarks about the dataset LICENSE: You're free to do whatever you like with the dataset itself, but please abide by the CC-BY-SA license when dealing with the parser code. ACCURACY: The dataset maps the user id hashes to a 3-byte integer, but you can resolve them back with the hashes dataset (hashes.bin). it also flattens moderation rectangle events to separate pixels, and flags such events in the 57th bit (rect). Any additional accuracy remarks of each dataset are noted in its NOTES.txt file STRUCTS: ENTRY ==================================================================== [ 9/6 bytes | 72/48 bits ] NAME |BIT| 0 1 2 3 4 5 6 7 8 |MAX VALUE | |00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000| delta |20 |10000011 00111000 0000 . . . . . . |537472 pos x |11 | . . 1111 1001111 . . . . . |1999 pos y |11 | . . . 1 11110011 11 . . . |1999 color |5 | . . . . . 11111 . . . |31 rect* |1 | . . . . . 1 . . |1 index*|24 | . . . . . 10011110 01100111 01101011|10381163 ==================================================================================================== delta - (int) the delta time (in milliseconds) since the last entry pos x - (int) the x position of the pixel placement pos y - (int) the y =|= color - (byte) index of the color placed, see below for mappings to the RGB space rect - (bool) whether or not this entry is a part of a 'moderation rectangle' event index - (int) a unique id for the user, which is mapped to reddit's user id hashes in hashes.bin. this field (ie the last 3 bytes) is not present in the absence of hashes.bin, which changes the struct's size to 6 bytes. HASH ======================================================================== [ 67 bytes / 536 bits ] NAME |BIT| 0 1 2 3 4 5 6 - 60 trimmed |MAX VALUE | |00000000 00000000 00000000 00000000 00000000 00000000 00000000 - | index|24 |10011110 01100111 01101011 . . . - |10381163 hash |512| . . 11111111 11111111 11111111 11111111 - |2⁵¹²-1 ===================================================================================================== index - (int) same as the the index of the entry struct hash - (byte*) a 64-byte hash provided by reddit to track unique users. the hashing method used is currently unknown COLOR ID TO RGB MAPPING: 00: 6D001A | 08: 7EED56 | 16: 6A5CFF | 24: 6D482F 01: BE0039 | 09: 00756F | 17: 94B3FF | 25: 9C6926 02: FF4500 | 10: 009EAA | 18: 811E9F | 26: FFB470 03: FFA800 | 11: 00CCC0 | 19: B44AC0 | 27: 000000 04: FFD635 | 12: 2450A4 | 20: E4ABFF | 28: 515252 05: FFF8B8 | 13: 3690EA | 21: DE107F | 29: 898D90 06: 00A368 | 14: 51E9F4 | 22: FF3881 | 30: D4D7D9 07: 00CC78 | 15: 493AC1 | 23: FF99AA | 31: FFFFFF