Technology / Fri, 05 Jun 2026 the-decoder.com

Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"

Microsoft partly trained its new MAI models on unlicensed web data. Microsoft had previously claimed the MAI models were trained only on "enterprise grade, clean and commercially licensed data." Like other AI companies scraping the web, Microsoft is likely relying on fair use. The paper describes the data as a "mixture of publicly available and licensed human-generated data." That puts the burden of protecting content on site owners, like assuming anyone who doesn't lock their door consents to a break-in.

Microsoft partly trained its new MAI models on unlicensed web data. The technical paper shows Microsoft used Common Crawl, among other sources, as Simon Willison noted. Microsoft had previously claimed the MAI models were trained only on "enterprise grade, clean and commercially licensed data."

Like other AI companies scraping the web, Microsoft is likely relying on fair use. The paper describes the data as a "mixture of publicly available and licensed human-generated data." For web data, Microsoft says it uses "a proprietary crawler that respects the Robots Exclusion Protocol (robots.txt) and related meta-tag and HTML controls, enabling site owners to manage how content on their sites is accessed and used."

That puts the burden of protecting content on site owners, like assuming anyone who doesn't lock their door consents to a break-in. Fair use remains contested, and courts are still sorting it out. In short, Microsoft does what every other AI company does, yet sells its training data as especially "clean." It isn't.

Ad DEC_D_Incontent-1