Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Data is the new oil, as they say, and it may be making Harvard the new Exxon.On Thursday, the school announced the launch of a database of nearly a million public domain books that can be used to train AI models The project has received funding from both Microsoft and OpenAI under the Data Initiative, and contains books scanned by Google Books that are old enough to be copyrighted. has expired.
To Lara a piece The new project says the database includes a wide variety of books: “Shakespeare, Charles Dickens and Dante classics are included alongside obscure Czech math textbooks and Welsh pocket dictionaries.” As a general rule, copyright protection lasts for the lifetime of the author and beyond 70 years.
Basic language models like ChatGPT, which behave like real human verisimilitude, require enormous amounts of high-quality text for their training; usually the more information they take in, the better the models are at imitating humans and delivering knowledge.But that thirst for data has caused problems, as the likes of OpenAI have hit walls in how much new information they can find—at least without stealing it.
Publishers incl Wall Street Journal and: New York Times sued OpenAI and competitor Perplexity for ingesting their data without permission Proponents of the AI companies have made various arguments to defend their activities. They will sometimes say that humans produce new works by researching and synthesizing material from other sources, and AI is no different. is legally considered fair use if the new work is materially different.But that doesn’t take into account that humans can’t swallow billions of pieces of text at the speed a computer can, so it’s not a fair comparison Wall Street Journal in himself lawsuit against Perplexity said the startup was “copying on a massive scale.”
Players in the space have also made the argument that any content is available on the open web essentially fair game and that the chatbot user is the one who accesses the copyrighted content by requesting it.Basically, a chatbot like Perplexity is like a web browser.It will be a while before these arguments play out in court.
OpenAI has struck deals with some content providers in response to criticism, and Perplexity has launched an advertising partner program with publishers, but it’s clear they’ve done so reluctantly.
At the same time that AI companies are running out of new content to use, commonly used web sources already included in training sets quickly began to restrict access. Companies including Reddit and X have been aggressive in limiting their use of data because they’ve recognized its enormous value, especially in having real-time data to augment fundamental models with more up-to-date information about the world.
Reddit is cooking hundreds of millions of dollars Licensing its corpus of subtitles and comments to Google to train its models, Elon Musk X has an exclusive arrangement with his other company, xAI, to allow its models to access the social network’s content for training and up-to-date information retrieval. It’s kind of ironic to think that these companies strictly guard their data, but essentially believe that media publishers’ content has no value and should be free.
A million books will not be enough to cover the training needs of any AI company, especially considering that these books are old and do not contain modern information, as the Gen Z kids use the jargon to distinguish themselves from their competitors companies will want to continue to access other data, especially unique ones, so they don’t all build the same models. The Institutional Data Initiative’s database can at least offer some help to AI companies trying to train their startup basic models without getting into any legal issues.