Sentence collection for Japanese language (日本語の文章収集について)
Sentence collection for Japanese language (日本語の文章収集について)
I write the sentences.
I'm publishing the sentences I've submitted for reference. I also note my thoughts and questions (This note is written in the kana orthography and seiji, sorry).
Now, let me ask you a few questions.
Even in Japanese, there are different ways to read it. As written in the how-to, they should probably be avoided.
However, if it's part of a word, it would be an exception, I think. For example,
- 一緒 meaning "together"
- 十分 meaning "enough"
- 三十路 meaning "thirty years old"
- 五十嵐 is a person's family name
- 九十九髪 meaning "old woman's gray hair" or "old woman with gray hair"
2. っ/ッ at the end of a word
Some of the words are indeed as exclamations. For example, "あっ" and "えっ".
There is no specific pronunciation for this spelling. The pronunciation depends on the speaker.
It is written to express some "momentum". In other words, a stress.
Can I include it in a sentence?
3. Sentence length
How about 35 characters or less, excluding punctuation?
In reference to the topic on sentence length limit, sentences that can be spoken in less than 10 seconds seem to be appropriate.
This is the reference time that the native speaker read slowly:
- 今日は良い天気ですね。(3 seconds / 10 characters)
- 私はそれを恋だなんて思ってないけどね。(5 seconds / 18 characters)
- お父さんは箔が付くからって言うけど、あたしとしちゃどうだって良いね。(7 seconds / 32 characters)
- このサハラというのが本名から取ったにしろ、サハラ砂漠か何かから取ったにしろ、大して興味は無い。(11 seconds / 44 characters)
4. Range of kanji to be used
This is a difficult question. Can I write in the sense of a native speaker?
For example, the kanji for 俺 (meaning "I") is a very popular first person character, and even children can read it, but I was surprised to see that someone read it as kare (meaning "he") in Common Voice.
It is not wrong to write it in hiragana. However, a sentence with many hiragana characters becomes difficult to read.
I am automatically translating it into English. Sorry if that was unnatural!
I hope to inspire you.
Do I understand that those take place of English
!?.,, or are similar in their intent?
Hmmm. Maybe. If someone can remove it from the data, I think it's okay.
How do we determine? .... "mood" is all I can say. A feeling. It works the same way as a exclamation mark. It expresses the speaker's feelings of surprise, emotion, anger, and joy.
Perhaps I'm just too concerned about it. But I'm glad you answered it. Thank you so much, @Adrijaned.
just choose whatever length you find suitable, and better shorter than longer.
The shorter the better. You're right. I should break up my sentences more.
as long as there is not multiple ways to pronounce what you have written in the context it is in, it should be fine.
I see. So I guess "abbreviations" are okay too. With the exception of a few, Japanese abbreviations have a fixed reading.
And this was my question. There are multiple ways to read kanji. I'll write about it later.
On reading Kanji characters
I don't know much about the system, so it was very helpful to get your input. Thank you, Craig.
So, for example, the sentence "明日行くよ。" can be read either as "Ashita iku yo" or "Asu iku yo", and I can provide such a sentence to the Collector, right?
--that is, as long as the context (the meaning of the sentence) is clear, the speaker does not have to worry about multiple readings of the kanji. Rather, each reading is necessary for the system to learn. Is this interpretation correct?
Yes, if the voice recognition system can learn Chinese, it will probably be fine in Japanese as well (and the existing system actually understands what we speak). As you said, there are very few sentences that are difficult to interpret.
I looked it up (on the web) too. As you say, the etymology of the word あした and あす are different. But today, they are used to mean the same thing. 'The next day'. In the example "明日行くよ。", most people would read Ashita more often than not. As @safejourney pointed out, the example is quite colloquial and casual in its wording. As I mentioned in post #4, most people will read it as Ashita, though not always.
Yes, it's reasonable to control it with the way we write it. But some people might read it as Ashita even in a stiff sentence. I'd like to hear more opinions from Japanese speakers.
Sure, I think speech-to-text is fine. Fortunately, there are no words that are pronounced the same as 明日 (actually, there are a few, but we can limit them by context).
Hmmm, the reading used in everyday life is certainly limited.
In order for the system to properly convert speech into text, the sentences should still be appropriate (i.e. natural), right? I used to write some of my sentences in hiragana as a workaround for How-to's (e.g., I wrote 一回り as ひと回り. There are certainly uses for this kind of usage. I see it all the time on blogs and in magazines).
Also, people tend to fix hiragana for difficult kanji, or those that may be discriminatory. For example, 焼い弾 (焼夷弾, meaning "incendiary shell") in TV news, and 障がい者 (障害者, meaning "handicapped person") in public facilities. Of course, it depends on which characters we find difficult or discriminatory.
The names of plants and animals are not uniform, either. Some people write "cat" as 猫 (kanji), while others write ネコ (katakana).
Yeah, it doesn't matter either way (as long as it makes sense).
Thank you for all the advice! I'm glad you took an interest in this topic!
It is true that people who use 正字 today are minor (of course, there are individuals and groups that use them. For example, 國語問題協議會; みんなのかなづかひ is a doujinshi that publishes writings that include seiji). I was surprised when I first saw it, too. Considering how few people can read, it's probably best not to use it in sentence collection. So the sentence I sent to the Collector tool is in 新字.