First Target Segment in Japanese language

Linking voice, word, and number

I've mentioned it in Japanese language: Supplements for Digits, Counters, Popular readings, etc.:

Explain that it is a number.
- For example, annotate the sentence card,
  - This is part of the first target segment [digits 0-9 / Yes / No / Hey / Firefox]
  - Please read the number.
- The Japanese language is full of homonyms, and even when shown in hiragana, we can't tell it's a number.
  - Why hiragana? (Maybe that's to limit the way of reading.)
  - And does Common Voice want to collect different notations (e.g. hiragana, katakana, kanji)?
- Even if kanji are used, for example, the "一" is just a bar line, as you can see, and is indistinguishable from a symbol.
- The speaker doesn't necessarily see this topic.
- Well, sure, if we can pronounce it, that's fine, maybe. But we want people to be able to pronounce it in the sense of numbers, don't we?
Why are Heyヘイ and Firefoxファイアフォックス excluded? It can be pronounced in Japanese.
- ref. common-voice/singleword-benchmark.txt
There is no most popular reading.
Some readings are minor.

Etc.

Despite being verified by native speakers, the current Japanese target segment seems unnatural.

I don't know about voice recognition systems, but wouldn't it be possible, for example, to display 0123456789 (Arabic numerals) to the speaker and link that voice to the text of 零一二三四五六七八九 (kanji) in the dataset?
Arabic numerals will definitely tell us it's a number. (If we can't hope for annotations on the sentence card.)
In fact, if it's a single-digit number, as long as we can link the pronunciation (voice), the notation of the language, and the Arabic numerals, it's not a problem, right?
(Of course, it might not work well in other languages. In any case, I'm not an expert in voice technology.)