How to shuffle a big dataset (2018)

  • My immediate thought before reading the article is that shuffling is the inverse of sorting. To put another way, you can take a sorted list and apply a merge sort on it, but the comparison operation is random. Merge sort can work on data sets that are too large to fit in memory since you sort small sized blocks that do fit in memory, then merge using streaming

  • If you have a system with a fast sort/order by operation but a slow (or missing) shuffle, one trick that I sometimes use is to generate a new random number for each record, then sort by that random number.

    Uniform sampling is also quite easy (and principled) - you can just take the first n records.

  • I just finished an implementation of this algorithm for node.js: https://www.npmjs.com/package/big-shuffle

  • I did not see it mentioned in the article, but I wonder why or if the team rejected a permutation function on the index. The shuffled dataset would then be Xs[i] = X[f(i)]

  • Can you randomize memory pointers or something?