First Target Segment in Japanese language

Japanese language: Supplements for Digits, Counters, Popular readings, etc.

Japanese language: Supplements for Digits, Counters, Popular readings, etc.

Linking voice, word, and number

I've mentioned it in Japanese language: Supplements for Digits, Counters, Popular readings, etc.:

Etc.

Despite being verified by native speakers, the current Japanese target segment seems unnatural.

I don't know about voice recognition systems, but wouldn't it be possible, for example, to display 0123456789 (Arabic numerals) to the speaker and link that voice to the text of 零一二三四五六七八九 (kanji) in the dataset?
Arabic numerals will definitely tell us it's a number. (If we can't hope for annotations on the sentence card.)
In fact, if it's a single-digit number, as long as we can link the pronunciation (voice), the notation of the language, and the Arabic numerals, it's not a problem, right?
(Of course, it might not work well in other languages. In any case, I'm not an expert in voice technology.)