Microsoft's MAI Models: A Closer Look at Data Sourcing Practices

AI for Software Engineering (Copilots, SDLC, Testing) EN-US 05.06.2026

1 min read AI for Software Engineering (Copilots, SDLC, Testing) -/5

In short

At this stage, it can be observed that Microsoft's approach to training its MAI models raises significant questions regarding data sourcing.
Despite the company's assertions of utilizing only 'clean and commercially licensed data,' evidence suggests that unlicensed web data, including sources like Common Crawl, played a role in t
This practice mirrors that of other AI labs, where the reliance on fair use places the onus on website owners to restrict access to crawlers.

Read previous title Read next article in this category

Previous: Anthropic's Bold Move: Snatching OpenAI's Chip Engineer Amid IPO Frenzy · Next: Nvidia's Ambitious Leap into Physical AI at GTC Taipei

Editor: Martin Haak

At this stage, it can be observed that Microsoft's approach to training its MAI models raises significant questions regarding data sourcing. Despite the company's assertions of utilizing only 'clean and commercially licensed data,' evidence suggests that unlicensed web data, including sources like Common Crawl, played a role in the training process. This practice mirrors that of other AI labs, where the reliance on fair use places the onus on website owners to restrict access to crawlers. In this context, it is important to note the implications for stakeholders, including potential legal and ethical considerations. A final assessment would be premature at this point, but the situation underscores the need for transparency and clarity in AI data practices.

Source:

Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data" — The Decoder (EN-US)

HAI

In short

More in this category