The new tokenizer has 200,000 tokens in total, and about 25% of the tokens are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to count the number of tokens in different languages, and the top languages, besides English, are Russian, Arabic, and Vietnamese.
“So the tokenizer’s main impact, in my opinion, is you get the cost down in these languages, not that the quality in these languages goes dramatically up,” Das says. When an LLM has better and longer tokens in non-English languages, they can analyze the prompts faster and charge the users less for the same answer. With the new tokenizer, “you’re looking at almost four times cost reduction,” he says.
Das, who also speaks Hindi and Bengali, took a look at the longest tokens in those languages. The tokens show a clear emphasis on respective dialogues happening in those languages, so they would include words like “Narendra” or “Pakistan.” But other than those, it looks similar to a list of common long words in English, like Prime Minister, university, and international. They also don’t exhibit the issue in Chinese tokens.
That likely reflects the training data in those languages, Das says, “My working theory is the websites in Hindi and Bengali are very rudimentary. It’s like [mostly] news articles. So I would expect this to be the case. There are not many spam bots and porn websites trying to happen in these languages. It’s mostly going to be in English.”
Polluted data and a lack of cleaning
However, things are drastically different in Chinese. According to multiple researchers who have looked into the new library of tokens used for GPT-4o, the longest tokens in Chinese are almost exclusively spam words used in pornography, gambling, and scamming contexts. Even shorter tokens, like three-character-long Chinese words, also have a significant concentration on the same topics.
“The problem is clear: the corpus used to train [the tokenizer] is not clean. The English tokens seem fine, but the Chinese ones are not,” says Cai from Princeton University. Crawling spam and including it in training data is not rare, but usually, there will be significant effort taken to clean up the data before it’s used. “It’s possible that they didn’t do proper data clearing when it comes to Chinese,” he says.
The content of these Chinese tokens could suggest that they have been polluted by a specific phenomenon: websites hijacking unrelated content in Chinese or other languages to boost spam messages.
These messages are often advertisements of pornography videos and gambling websites. They could be real businesses or merely scams. And the language is inserted into content farm websites or sometimes legitimate websites so they can be indexed by search engines, circumvent the spam filters, and be found in random searches. For example, Google indexed one search result page on a US National Institute of Health website, which lists a porn site in Chinese. The same site name also appeared in at least five Chinese tokens in GPT-4o.
#GPT4os #Chinese #tokentraining #data #polluted #spam #porn #websites