Microsoft says that it's okay to steal web content because it's 'freeware.'

  • Someone's reasons for sharing information are coloured by the situation at the time of sharing it, amongst many other factors.

    Two years ago (say) no one predicted the meteoric rise of LLMs and their voracious appetite for data sets for training. These beasties are not simply search engines that are better direction pointers to your stuff (with a frisson of ads) but insist on being the final word and keep you out. To be blunt: It is stealing.

    The implied contract for publishing on the web has changed again, just as it has several times in the past. The worst thing here is the use of the term "freeware". Describing original content, displayed for all to see as -ware is outrageous.

    They might as well describe the content on Spotify and co as freeware ... bear with me: you could scrape wifi connections through your publicly available APs or even do some more broadband funky spectrum capture analysis and claim that is what an internet search engine does in its spare time and all is fine (lol).

    LLMs and GenAI are quite interesting things but I do not think that they are the last word in ... AI. Anyway the latest cool thingie cannot be allowed to break whatever the current unspoken and somewhat undefined social contract is in place.

    This bloke from MS seems to have forgotten that there really is a social contract of some sort and that if you say: "fuck you lot, omnomnom ... mmmm data ... ... laters (lol)" there might be some come back.

  • It’s a pro-AI position but not really controversial?

    My reading is he is saying content that is not under an explicit license for usage, that is made available publicly and freely, is fair game for training.

    > In his remarks, Suleyman claimed that all content shared on the web is available to be used for AI training unless a content producer says otherwise specifically.

    > "With respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That's been the understanding," said Suleyman.

    > "There's a separate category where a website or a publisher or a news organization had explicitly said, 'do not scrape or crawl me for any other reason than indexing me so that other people can find that content.' That's a gray area and I think that's going to work its way through the courts."

  • It's ironic that Microsoft used copyright protection and IP law for years to secure a dominant market position, and now they don't need to play by the same rules because "something something AI".

  • Before we get too upset... can we verify this is MSFT's official position? I suspect this may be hyperbole. It could be Sulyman was constructing a hypothetical point that didn't survive translation into click-bait. That being said... MSFT has a history of chicanery. I'm off to try to find original sources. If anyone else has any, please provide a link.

    FWIW... I found a few videos related to Endicott's story:

    * This is a quick 5 minute video where Suleyman talks about how indeterminacy is good. So... you know... it's a good think that Co-Pilot can't tell you why it thinks it needs to dump 800 line of java code into your hello world program. At around 3:44, he confuses LLMs (with a surface understanding of syntax married with a markov chain on steroids) with people (who as best we can tell have a different understanding of the thing represented.) Corporate management confusing the the map with the territory? Who could have forseen such a thing: https://youtu.be/GsGFYoIx1YM

    * This one seems to be the longer version, but I'm still looking for where Endicott's quote comes from, but around the 14minute mark is where the conversation turns towards "who owns the ip" used to train LLMs and the terms "Fair Use" and "Freeware" are used around the 14m50s mark: https://youtu.be/lPvqvt55l3A

    [EDIT: So... yes... get out the pitch-forks... Microsoft is saying anything on the web is inherently freeware or subject to fair use even if you think you remember putting a copyright notice on it (or, as is mentioned in US copyright law, the creator automatically receives copyright protections upon creation of the work.)]

  • Of course it's okay.

    I make an http _REQUEST_, the server voluntarily fulfills the request.

    Why is it okay for a person to view your content, memorize it, and use it as a base for new content while it's not okay for an AI? at the end of the day it is the same thing.

  • The Windows source code was leaked onto the web many years ago wasn't it? Guess that makes it freeware too.

  • With this logic, so is pirated software right? It's free because it's on the internet.

  • This doesn't deprive the original owner, so they should use "share" or "pirate" instead.