The Download: GPT-4o’s polluted Chinese training data, and astronomy’s AI challenge

Soon after OpenAI released GPT-4o last Monday, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases.

Humans read in words, but LLMs read in tokens, which are distinct units in a sentence that have consistent and significant meanings. GPT-4o is supposed to be better than its predecessors at handling multi-language tasks, and many of the advances were achieved through a new tokenization tool that does a better job compressing texts in non-English languages.

But, at least when it comes to the Chinese language, the new tokenizer used by GPT-4o has introduced a disproportionate number of meaningless phrases—and experts say that’s likely due to insufficient data cleaning and filtering before the tokenizer was trained. If left unresolved, it could lead to hallucinations, poor performance, and misuse. Read the full story.

—Zeyi Yang

Astronomers are enlisting AI to prepare for a data downpour

In deserts across Australia and South Africa, astronomers are planting forests of metallic detectors that will together scour the cosmos for radio signals. When it boots up in five years or so, the Square Kilometer Array Observatory will look for new information about the universe’s first stars and the different stages of galactic evolution.

But after synching hundreds of thousands of dishes and antennas, astronomers will quickly face a new challenge: combing through some 300 petabytes of cosmological data a year—enough to fill a million laptops. So in preparation for the information deluge, astronomers are turning to AI for assistance. Read the full story.

#Download #GPT4os #polluted #Chinese #training #data #astronomys #challenge

What's Hot

Justice Department calls for break up of Google and sale of Chrome

AI can now create a replica of your personality

Reddit is down for many users, according to DownDetector. Here’s what to know.

The Download: GPT-4o’s polluted Chinese training data, and astronomy’s AI challenge

AI can now create a replica of your personality

The Download: Clear’s identity ambitions, and the climate blame game

Who’s to blame for climate change? It’s surprisingly complicated.

Inside Clear’s ambitions to manage your identity beyond the airport

Leave A Reply Cancel Reply

Justice Department calls for break up of Google and sale of Chrome

AI can now create a replica of your personality

Reddit is down for many users, according to DownDetector. Here’s what to know.

Deepfakes of Elon Musk are contributing to billions of dollars in fraud losses in the U.S.

Deepfakes of Elon Musk are contributing to billions of dollars in fraud losses in the U.S.

Subscribe to Updates

What's Hot

The Download: GPT-4o’s polluted Chinese training data, and astronomy’s AI challenge

Related Posts

Leave A Reply Cancel Reply