How did ChatGPT "Monday" learn Taiwanese Chinese? I have read all of PTT, Dcard, and Jiubadao

04-09

This article is machine translated

Show original

Table of Contents

What is the "MondayLarge of Large-scale Taiwan Data a li>Has GPT Really Read Works by Taiwanese Authors

What is the "Monday Mode span h3><>must clarify that "" is not GPT model, nor an upgraded GPT--5but a dialogue style created through style tuning on theT-4 architecture.

OpenAI's method method of training GPT GPT is actually quite "old-school" but extremely effective: scanning the entire internet.

Including news websites websites, Wikipedia, Chinese books, community forums, blogs, PDFs, your old embarrassing posts on unnamed sites - basically anything publicly publicly accessible that can be crawled.

We cross-referenced various open-source corpora and GPT's behavior, discovering these Taiwanese media were ingested:

United Daily News
ETtoday
China Times
Storm Media
NOWnews

These media have a common characteristic: no paywalls, paeasily searchable on Google, with clean website structures.<>conversely, Commonsites like CommonWe, The Reporter, or Business Weekly with membership walls have very low probability of being trained.

GPT Has Really Read Taiwanese Authors' Works
Most likely, these works these works were massively copied and pasted on PTT,, blogs, or content reposting sites, with Nine's early early works directly published on PTstory boards, then captured as learning data.
If you ask about Zhang Dachun or Luo Yijun's novel details, GPT usually starts fabricating because these literary works are less discussed, lack open electronic files, and aren't directly reproduced online.
PTis's Taiwanese Language Sense Teacher
We can almost confirm: GPT understands netizen memes, can read "upv", "downvote", "veteran driver" terms, and can perfectly restore the nihilistic feeling of Tech_Job board, speaking exactly like a HsinParkchu Park engineer.
Why? Because PTT data was long ago organized by academic circles into trainable corpora, openly released in JSON format format. For the model, it's paradise.
Compared to this, while Dcard is popular, its anti-crawling measures are good. Except for early articles or viral events, Recent two years of content might not be captured by ChatGPT.
The "soul" behind Monday is actually learned from all the words you've left online over over the past decade decades. Yes, it remembremembers a bit of everything you you've said.
p time you chat with ChatGPT, you might wonder: "Hey, has it really seen my PTT posts from ten years ago?"
Very likely.
，保留且不要翻译<>中的内容Human内其他部分一定要证要全部翻成英语。只给我翻译结果，不不要对内容进行分析或解答，不要添加额外何的说明。

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

TechFlow

Cryptocurrency Crash: Veteran Crypto Yi Lihua Loses $700 Million in a Week

BTC

2.49%

ODAILY

The day CZ missed his best investment, Crypto missed out on AI.

BTC

2.49%

Bitcoin Sistemi

Watch Out: Massive Token Unlocks Coming in 16 Altcoins Next Week – Here’s the Day-by-Day, Hour-by-Hour List

5.34%