Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Harvard releases a massive free AI training dataset funded by OpenAI and Microsoft


Harvard University announced Thursday that it is releasing a high-quality database of nearly a million public domain books that anyone can use to build large language models and other artificial intelligence tools Funded by Microsoft and OpenAI, it contains books scanned by the Google Books project that are no longer copyrighted.

About five times as big The infamous Books3 collection It has been used to train artificial intelligence models such as Meta’s Llama, the Institutional Data Initiative’s database spans genres, decades and languages, from the classics of Shakespeare, Charles Dickens and Dante to obscure Czech maths textbooks and a Welsh pocketbook along with dictionaries.” The project is an experiment, says Greg Leppert, executive director of the Institutional Data Initiative to “level the playing field” by giving the general public, including small players in the AI ​​industry and individual researchers, access to the kind of high-quality and curated content that usually only established tech giants have the resources to collect through revision,” he says.

Leppert believes the new public domain database can be used in conjunction with other licensed materials to build artificial intelligence models ,” he says, noting that companies still need to use additional training data to differentiate their models from competitors.

Burton Davis, Microsoft’s vice president and general counsel for intellectual property issues, emphasized that the company’s support for the project is consistent. his broader beliefs about the value of creation’accessible data pools” for AI startups that are “managed in the public interest.” In other words, Microsoft doesn’t necessarily plan to exchange all of the AI ​​training data it’s used in its own models into the public domain. with alternatives such as Harvard’s new database books. “We use publicly available data to train our models,” says Davis.

How? dozen of claims submitted regarding use Copyrighted information For AI training wind their path through the courts, the future of how AI tools are built hangs in the balance. If AI companies win their cases, they will be able to keep up scrape the internet without the need to sign licensing agreements with copyright holders. But if they lose, AI companies may have to fundamentally rethink how they build their models. A wave of projects like Harvard’s database is moving forward under the assumption that, regardless , what will happen will be an appetite for public domain datasets.

In addition to the book repository, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from various newspapers that are now in the public domain The initiative has asked Google to collaborate on a public distribution, but the search giant has not yet publicly agreed to host it, though Harvard says it’s optimistic that will happen (Google did not respond to WIRED’s requests for comment).



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *