Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Daily Archives: April 19, 2023

Inside the secret list of websites that make AI like ChatGPT sound smart

Washington Post: “AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations. Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet. This text is the AI’s main source of information about the world as it is being built, and it influences how it responds to users. If it aces the bar exam, for example, it’s probably because its training data included thousands of LSAT practice sites. Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data. To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT) The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown. Hover over the boxes above to view the top sites in each category. We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.

  • Note – you may search for a website mid way in the article – here

Why universities should return to oral exams in the AI and ChatGPT era

The Conversation: “Imagine the following scenario. You are a student and enter a room or Zoom meeting. A panel of examiners who have read your essay or viewed your performance, are waiting inside. You answer a series of questions as they probe your knowledge and skills. You leave. The examiners then consider the preliminary pre-oral… Continue Reading

Plastic Recylcing is a Scam

Plastic Recycling is an Actual Scam | Climate Town / YouTube Plastic Legislation to get involved with: Depending on where you live, there’s a lot you can do. NCSL Overview:… Earth Day Plastic Action:… Cleveland Preemptive Ban:… Nat Geo breakdown:… CIF:… WSJ:… NRDC:… Follow Climate Town on Instagram:… Continue Reading

Two new scientific papers break down how the rich are destroying Earth

Salon: “As the climate crisis becomes more acute — exemplified in interminable wildfire “seasons”, intense drought and extreme weather — it’s becoming clear that saving the planet will involve more than politely asking consumers to recycle their yogurt cups. Indeed, many of climate change’s effects are largely spurred by resource hoarding and inequality via the… Continue Reading

Nearly 1,500 books bans implemented in the first half of this school year

The Hill: “Almost 1,500 school book bans were put into place around the U.S. in the first half of the current academic year, according to PEN America. An analysis from the group released Thursday found 1,477 book bans implemented in the first half of the 2022-2023 school year, affecting 874 unique books. The six months… Continue Reading

The State of Scholarly Metadata: 2023

‘In late 2022, CCC and Media Growth Strategies undertook a thorough examination of metadata management across the research lifecycle. This in-depth review builds on an existing body of work to uncover multiple policy and system complexities and breakages, which – separately and together – create missed opportunities for the communities for whom Open Access (OA)… Continue Reading