On reading Kanji characters
The way a number is read depends on context and might introduce confusion in the dataset.
I've always wondered about this sentence from How to, too.
The way a kanji is read depends on context! And most kanji have two or more readings on their own.
I'll list as many as I can think of.
A. same meaning / same character / different reading
I think this is what @Adrijaned is concerned about:
as long as there is not multiple ways to pronounce what you have written in the context it is in, it should be fine.
- 〇 (Rei / Zero / Maru) meaning "Zero". Maru is a limited reading.
- 四 (Shi / Yon) meaning "Four". Both are major readings.
- 七 (Shichi / Nana) meaning "Seven". Both are major readings, too.
- 明日 (Ashita / Asu / Myōnichi) meaning "tomorrow". Asu and Myōnichi are a bit formal.
- 昨日 (Kinō / Sakujitsu) meaning "yesterday". Sakujitsu is a bit formal, too.
- 重複 (Chōfuku / Jūfuku) meaning "duplicate". Is there more people who read Jūfuku?
- 経緯 (Keii / Ikisatsu) meaning "circumstance". Is there more people who read Keii?
- 世論 (Seron / Seiron / Yoron) meaning "public opinion". I'm sure most people don't know about Seiron. It is generally read as Yoron.
Certainly, the context can narrow down the reading to some extent. But it's a "trend", not an absolute. How a speaker reads depends on their knowledge and lifestyle (e.g. occupation, amount of reading, etc.). Or, more to the point, it can be a matter of "preference". Therefore, when we are asked to read something correctly, we are perplexed. "They are all correct, aren't they?"
The speech algorithm needs to know how to read everything.
B. same meaning / different character / same reading
It is used differently depending on the meaning of each character. Or preference.
- 暗黒 / 闇黒
- 日差し / 陽射し
C. different meaning / same character / different reading
The reading depends on the context and the word.
- 小人 / 小人
- 最中 / 最中
- 落着 / 落ち着く
- 過去 / 過ぎ去る
- 明るい / 暗い / 明暗
- ここは人気があります。(This place is popular.)
- ここは人気があります。(There are signs of people here.)
Yes, it's impossible to determine how to read in this short context.
D. different meaning / different character / same reading
So-called 同音異義語 (meaning "homonyms").
- けんとう: 見当 / 拳闘 / 軒灯 / 健闘 / 検討 / 賢答 and more.
- せいかく: 正確 / 性格 / 正格 / 精確 / 醒覚 and more.
- いし: 石 / 意志 / 医師 / 遺志 / 遺子 and more.
- かなう: 適う / 叶う / 敵う
- 記事に書けている部分がある。(There are parts of the article that could be written about.)
- 記事に欠けている部分がある。(There is a part of the article that is missing.)
- 生地に欠けている部分がある。(There is a part of the fabric that is missing.)
- 生地に掛けている部分がある。(There is a part of the fabric that the fabric.)
- Um, more?
All Japanese pronunciations can be written in hiragana, but here's why they shouldn't be. Of course, there is a difference in intonation between 書けて and 欠けて. But 記事 and 生地 are the same. If we're trying to figure out the meaning from a hiragana sentence, we're going to need more "background".
- ここで履き物を脱ぎます。(This is where you take off your footwear.)
- ここでは着物を脱ぎます。(This is where you take off your kimono.)
It's a common pun. Like "Ice Cream" and "I Scream"? It's pronounced a little differently, though.