Sentence collection for Japanese language (日本語の文章収集について)

Hi all.
I write the sentences.
I'm publishing the sentences I've submitted for reference. I also note my thoughts and questions (This note is written in the kana orthography and seiji, sorry).

Now, let me ask you a few questions.

1. Numbers

日本語版: 數字について

Even in Japanese, there are different ways to read it. As written in the how-to, they should probably be avoided.
However, if it's part of a word, it would be an exception, I think. For example,

一緒issho meaning "together"
十分jūbun meaning "enough"
三十路miso-ji meaning "thirty years old"
五十嵐igarashi is a person's family name
九十九髪tsukumo-gami meaning "old woman's gray hair" or "old woman with gray hair"

2. っ/ッ at the end of a word

日本語版: メモの項

For example,

負けてたまるかっ。
マジかよッ！

Some of the words are indeed as exclamations. For example, "あっ" and "えっ".
There is no specific pronunciation for this spelling. The pronunciation depends on the speaker.
It is written to express some "momentum". In other words, a stress.
Can I include it in a sentence?

3. Sentence length

日本語版: 文章の長さについて

How about 35 characters or less, excluding punctuation?
In reference to the topic on sentence length limit, sentences that can be spoken in less than 10 seconds seem to be appropriate.

This is the reference time that the native speaker read slowly:

今日は良い天気ですね。(3 seconds / 10 characters)
私はそれを恋だなんて思ってないけどね。(5 seconds / 18 characters)
お父さんは箔が付くからって言うけど、あたしとしちゃどうだって良いね。(7 seconds / 32 characters)
このサハラというのが本名から取ったにしろ、サハラ砂漠か何かから取ったにしろ、大して興味は無い。(11 seconds / 44 characters)

4. Range of kanji to be used

This is a difficult question. Can I write in the sense of a native speaker?
For example, the kanji for 俺ore (meaning "I") is a very popular first person character, and even children can read it, but I was surprised to see that someone read it as kare彼 (meaning "he") in Common Voice.
It is not wrong to write it in hiragana. However, a sentence with many hiragana characters becomes difficult to read.

I am automatically translating it into English. Sorry if that was unnatural!
I hope to inspire you.
Thank you!

Re: Post #2

Do I understand that those take place of English !?.,, or are similar in their intent?

Hmmm. Maybe. If someone can remove it from the data, I think it's okay.
How do we determine? .... "mood" is all I can say. A feeling. It works the same way as a exclamation mark. It expresses the speaker's feelings of surprise, emotion, anger, and joy.
Perhaps I'm just too concerned about it. But I'm glad you answered it. Thank you so much, @Adrijaned.

just choose whatever length you find suitable, and better shorter than longer.

The shorter the better. You're right. I should break up my sentences more.

as long as there is not multiple ways to pronounce what you have written in the context it is in, it should be fine.

I see. So I guess "abbreviations" are okay too. With the exception of a few, Japanese abbreviations have a fixed reading.

And this was my question. There are multiple ways to read kanji. I'll write about it later.

Re: Post #3

On reading Kanji characters

Re: Post #5

I don't know much about the system, so it was very helpful to get your input. Thank you, Craig.

The sentence collector should contain sentences in most instances, not single words. All languages may have ambiguity in how to read a single word, letter, or character, but there will typically be much less ambiguity for a whole sentence.

I don’t know Japanese, but in Chinese, handling of 多音字 (characters with multiple pronunciations) is an integral part of speech recognition systems. Such homographs are found for all languages except those with perfectly phonemic writing systems.

Again, the variation present in the dataset will help the model learn. We do our best to create sentences which might be similar to those found in the target application, and the model must take care of modeling the variation. Modern ASR models are more than capable of disambiguating 多音字 (homographs) when enough context is provided, and detecting and adapting to different accents and speaking styles, and it is potentially important that these phenomena are part of the dataset.

So, for example, the sentence "明日行くよ。" can be read either as "Ashita iku yo" or "Asu iku yo", and I can provide such a sentence to the Collector, right?
--that is, as long as the context (the meaning of the sentence) is clear, the speaker does not have to worry about multiple readings of the kanji. Rather, each reading is necessary for the system to learn. Is this interpretation correct?

Yes, if the voice recognition system can learn Chinese, it will probably be fine in Japanese as well (and the existing system actually understands what we speak). As you said, there are very few sentences that are difficult to interpret.

Re: Post #7

Hi Craig.

According to my dictionary (Wiktionary), the pronunciation ashita for 明日 is a colloquial form meaning “tomorrow”, while asu is a polite form with the same meaning, both native Japanese words with two different origins (not etymologically Chinese). Since I don’t know Japanese (明日 is an archaic or formal term in Chinese, while the same word has persisted in colloquial Beijing dialect but is written 明兒 and pronounced as a single syllable míngr), I’m unable to determine the degree of ambiguity of this sentence. Would one pronunciation be more likely than another here, if you showed the sentence to several different native speakers? If one pronunciation is more likely here, then a computer model could capture this pattern.

I looked it up (on the web) too. As you say, the etymology of the word あしたAshita and あすAsu are different. But today, they are used to mean the same thing. 'The next day'. In the example "明日行くよ。", most people would read Ashita more often than not. As @safejourney pointed out, the example is quite colloquial and casual in its wording. As I mentioned in post #4, most people will read it as Ashita, though not always.

I wonder if there would be a way to make this sentence unambiguous? For example, add some more polite words to show that asu is the best reading, or add some colloquial words for ashita. In my opinion this would be the best approach, if it is indeed very ambiguous. It really depends on the application; I believe it could be problematic for text-to-speech (TTS), but ok for speech-to-text (ASR), since there is no guessing involved here for ASR.

Yes, it's reasonable to control it with the way we write it. But some people might read it as Ashita even in a stiff sentence. I'd like to hear more opinions from Japanese speakers.
Sure, I think speech-to-text is fine. Fortunately, there are no words that are pronounced the same as 明日 (actually, there are a few, but we can limit them by context).

Hmmm, the reading used in everyday life is certainly limited.

In order for the system to properly convert speech into text, the sentences should still be appropriate (i.e. natural), right? I used to write some of my sentences in hiragana as a workaround for How-to's (e.g., I wrote 一回り as ひと回り. There are certainly uses for this kind of usage. I see it all the time on blogs and in magazines).
Also, people tend to fix hiragana for difficult kanji, or those that may be discriminatory. For example, 焼い弾shōidan (焼夷弾, meaning "incendiary shell") in TV news, and 障がい者shōgaisha (障害者, meaning "handicapped person") in public facilities. Of course, it depends on which characters we find difficult or discriminatory.
The names of plants and animals are not uniform, either. Some people write "cat" as 猫 (kanji), while others write ネコ (katakana).

Yeah, it doesn't matter either way (as long as it makes sense).

Re: Post #8

Hi safejourney!
Thank you for all the advice! I'm glad you took an interest in this topic!

Two is that some 漢字 you use are not natural for Japanese. For example, [漢字の讀み方について], people in Japan use normally 読み方. I think 讀み方 is used in a newspaper or an old book.

It is true that people who use 正字seiji today are minor (of course, there are individuals and groups that use them. For example, 國語問題協議會KOKUGOMONDAI KYOUGIKAI; みんなのかなづかひMinna no kanazukai is a doujinshi that publishes writings that include seiji). I was surprised when I first saw it, too. Considering how few people can read, it's probably best not to use it in sentence collection. So the sentence I sent to the Collector tool is in 新字shinji.