engadget: “A team led by computer scientists from MIT examined ten of the most-cited datasets used to test machine learning systems. They found that around 3.4 percent of the data was inaccurate or mislabeled, which could cause problems in AI systems that use these datasets. The datasets, which have each been cited more than 100,000 times, include text-based ones from newsgroups, Amazon and IMDb. Errors emerged from issues like Amazon product reviews being mislabeled as positive when they were actually negative and vice versa. Some of the image-based errors result from mixing up animal species. Others arose from mislabeling photos with less-prominent objects (“water bottle” instead of the mountain bike it’s attached to, for instance)…One of the datasets centers around audio from YouTube videos. A clip of a YouTuber talking to the camera for three and a half minutes was labeled as “church bell,” even though one could only be heard in the last 30 seconds or so. Another error emerged from a misclassification of a Bruce Springsteen performance as an orchestra…”
Sorry, comments are closed for this post.