Microsoft's MAI Models: A Closer Look at Data Sourcing Practices
1 min read
AI for Software Engineering (Copilots, SDLC, Testing)
-/5
In short
- At this stage, it can be observed that Microsoft's approach to training its MAI models raises significant questions regarding data sourcing.
- Despite the company's assertions of utilizing only 'clean and commercially licensed data,' evidence suggests that unlicensed web data, including sources like Common Crawl, played a role in t
- This practice mirrors that of other AI labs, where the reliance on fair use places the onus on website owners to restrict access to crawlers.
At this stage, it can be observed that Microsoft's approach to training its MAI models raises significant questions regarding data sourcing. Despite the company's assertions of utilizing only 'clean and commercially licensed data,' evidence suggests that unlicensed web data, including sources like Common Crawl, played a role in the training process. This practice mirrors that of other AI labs, where the reliance on fair use places the onus on website owners to restrict access to crawlers. In this context, it is important to note the implications for stakeholders, including potential legal and ethical considerations. A final assessment would be premature at this point, but the situation underscores the need for transparency and clarity in AI data practices.