Repetitiveness and compressibility analysis in song lyrics

  • Great chart, unfortunate conclusion with some erroneous allusions. Bear with me here: Starting with ABBA, and fully blossomed in the form of Max Martin, Top 40 Pop has been dominated by Swedish composition techniques.

    Taylor Swift, Britney Spears, Kelly Clarkson, NSYNC, Bieber, Katy Perry, Demi Lovato...Max Martin's fingerprints are all over the hits. He has a defined style as well. Balanced lines. It's brilliant. Thus, it's not about the performers if you want to study the composition - you have to go to the actual composer(s).

    Just saying that this is a sound technique and approach but looking at the data set at the exclusion of pertinent considerations. Revised, it would make for an interesting story.

  • Looks like interesting work, but the main chart doesn't work for me in Firefox or Chromium - it seems to be yoked to your scroll position (why?!) so by the time you've scrolled down to the '2014' paragraph, which makes it chart the full time-series, you can't even see the graph in the first place... Data viz run amok.

  • Where'd they get the lyric data for this analysis? In my experience this data in bulk is all incredibly locked down!

  • Next: lossy lyrics compression where using words that sound the same could yield higher ratios! I wonder how well that would work for Sting where nobody can understand anything anyways.

  • Interesting analysis and great visual presentation. Would also be interested to see analysis on repetitiveness of intervals and rhythmic patterns used among popular songs. In many occasions people tend not to care about lyrics much in presence of addictive grooves/riffs, "Get Lucky" by Daft Punk being a good example.

  • I like the use of compression to find out about repetitions.

    There is some theory out there called Kolmogorov Complexity [0]. It says that something is as complex as how much information you need to express it. In your case, lyrics are as complex as how many symbols (letters? words? bytes?) you need to represent it.

    And one good way to calculate it is as you done: compress it. If you're using the same compression method for all the lyrics, you'll find that the ones that are more simple (and more repetitive) are the ones that have a great reduction on their sizes. In that case, the choice of which compression method you use is somehow irrelevant. Had you used Bzip, PPMD, etc., the results probably would be similar.

    In case you want to extend your research, for example, as 6stringmerc said, you might consider that the composer matters more than the actual artist.

    And, for that, you can use Normalized Compression Distance (NCD) [1]. That way you can measure how two lyrics are similar. Basicaly, you compress those lyrics together. When they are similar, clues from one are used by the compression to also compress the second one, so similar lyrics get more compression than lyrics that aren't related.

    And by doing that you can even discover who was the composer of the songs, i.e., the authorship of the lyrics, since each person usually has the same writing style... [2]

    [0] https://en.wikipedia.org/wiki/Kolmogorov_complexity

    [1] https://en.wikipedia.org/wiki/Normalized_compression_distanc...

    [2] https://link.springer.com/chapter/10.1007%2F978-3-642-34475-...

  • The visualization animations as you scroll through this article are fantastic, and a great way to implement storytelling with data. Also, the content is pretty interesting too, I like the emphasis on aligning data metrics with intuition to really make the point.

  • According to the main chart, songs in the top 10 are more likely to be repetitive, and that discrepancy has been growing. That raises the question, is there a causal link between being repetitive and reaching the top 10? If so, the answer to "Who's responsible for this madness?" is: the listeners.

  • What was used to make these visualizations? These are beautiful.

  • I would love to see the source code for how this post was created! (Or at least pointers for resources on how to create something similar.)

  • Some data messiness in that final chart. Maroon 5 versus Maroon5, Surfin' U.S.A but also Surfin' U.s.a

  • Music itself is about repetition, without repetition is just noise..

  • Any idea how to do the downres transition at the top? It's really cool.