Scientific American: “Artists and writers are up in arms about generative artificial intelligence systems—understandably so. These machine learning models are only capable of pumping out images and text because they’ve been trained on mountains of real people’s creative work, much of it copyrighted. Major AI developers including OpenAI, Meta and Stability AI now face multiple lawsuits on this. Such legal claims are supported by independent analyses; in August, for instance, the Atlantic reported finding that Meta trained its large language model (LLM) in part on a data set called Books3, which contained more than 170,000 pirated and copyrighted books. And training data sets for these models include more than books. In the rush to build and train ever-larger AI models, developers have swept up much of the searchable Internet. This not only has the potential to violate copyrights but also threatens the privacy of the billions of people who share information online. It also means that supposedly neutral models could be trained on biased data. A lack of corporate transparency makes it difficult to figure out exactly where companies are getting their training data—but Scientific American spoke with some AI experts who have a general idea.”
Sorry, comments are closed for this post.