A fresh controversy surrounding artificial intelligence has emerged. This time, Meta faces allegations of utilizing pirated torrent content to train its large language model (LLM), Llama, which serves as the foundation for Meta AI. This case marks one of the initial copyright lawsuits directed at a tech firm for AI training practices.
Documents Uncover Use of Pirated Content in Meta AI Training
According to a report by Wired, Meta was slapped with a lawsuit in 2023 for purportedly training Llama, the company’s LLM, with illegally obtained content. The suit, titled “Kadrey et al. v. Meta Platforms,” was initiated by authors Richard Kadrey and Christopher Golden, who alleged that Meta leveraged copyrighted material without proper authorization.
Previously, Meta had submitted documents with redactions to the court, but Judge Vince Chhabria from the United States District Court for the Northern District of California ruled that the original documents be published, which has since occurred.
The released documents showcase discussions among Meta employees regarding Meta AI and Llama. In one exchange, an engineer comments that “torrenting from a [Meta-owned] corporate laptop doesn’t feel right,” indicating the use of pirated material for AI training. Another conversation insinuates that “MZ” (Mark Zuckerberg) sanctioned the employment of pirated resources.
There is evidence suggesting that Meta exploited content from LibGen, a vast repository of pirated books, magazines, and academic papers. Established in Russia in 2008, LibGen has faced several copyright lawsuits, yet the actual operators of this “piracy hub” remain unidentified. Reports also indicate that Meta tapped into other “shadow libraries” for AI training.
The company defends its actions by asserting that it used public materials under the legal framework of “fair use,” allowing for certain uses of copyrighted content without permission, which are assessed on an individual basis. Meta claims that it merely “uses text to statistically model language and generate original expression.”
What’s Happening with Apple Intelligence?
This is not the first instance of major tech companies facing accusations of training their AI models on copyrighted material. Recently, an investigation revealed that Apple’s OpenELM model contained subtitles from over 170,000 YouTube videos.
Initially, this led to beliefs that Apple was utilizing copyrighted material for training Apple Intelligence. However, the company later clarified that OpenELM was an open-source model developed for research purposes and is not used to support Apple Intelligence.
Apple states that its AI capabilities available on iOS and macOS are trained “on licensed data, which includes data selected to enhance specific features, along with publicly available information gathered by our web crawler.”
It is noteworthy that numerous major publishers, including The New York Times and The Atlantic, have opted not to provide their content for training Apple Intelligence.
: . More.