Copyright Issues in Japanese language Collector

Japanese language collector have the following problems:

“username”: “navta”, “source”: “http://www.edrdg.org/wiki/index.php/Tanaka_Corpus”
1. “sentence”: “あきらめたら、そこで試合終了ですよ。”
  - From SLAM DUNK.
  - Ref: あきらめたら、そこで試合終了ですよ。 - Google 検索
2. “sentence”: “我が生涯に一片の悔いなし。”
  - From 北斗の拳.
  - Ref: 我が生涯に一片の悔いなし。 - Google 検索
3. “sentence”: “僕は新世界の神となる。”
  - From DEATH NOTE.
  - Ref: 僕は新世界の神となる。 - Google 検索
4. “sentence”: “あんたらの名前なんか興味ないね。どうせこの仕事が終わるとお別れだ。”
  - From ファイナルファンタジーVII.
  - Ref: あんたたちの名前なんか興味ないね。 - Google 検索
  - There are a few changes.

Perhaps this is a problem with the corpus.

I went to the source page and checked the "Public Domain version" and it contains the above text. These sources are famous cartoons and games, and they are obviously not in the public domain. The "Public Domain version" file has a [Manga] flag, but some of the sentences are not. Honestly, I can't determine how much of the offending text is in the mix.

Three Non-Public Domain Sources

Japanese language collection, again.

“username”: “navta”
- “source”: “http://d.hatena.ne.jp/satoru_net/20151030/1446184756”
  - This source is an unauthorized reproduction of ATR 503 sentences.
  - Corpus Name: ATR 503 sentences (Japanese name: ATR音素バランス503文)
  - Original is paid for: ATRデジタル音声データベース｜ATR音声言語データベース｜ATR-Promotions
- “source”: “https://github.com/voice-statistics/voice-statistics.github.com/blob/master/assets/doc/balance_sentences.txt”
  - Creator: 日本声優統計学会
  - Corpus Name: 音素バランス文
  - License: CC BY-SA 4.0; Creator said CC-BY-SA ライセンスで配布しています．
- “source”: “https://github.com/matbahasa/TALPCo/blob/master/data_jpn.txt”
  - Current: https://github.com/matbahasa/TALPCo/blob/master/jpn/data_jpn.txt
  - Corpus Name: TUFS Asian Language Parallel Corpus
  - License: CC BY 4.0; Corpus said TALPCo is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Are the reviewers, users Rrock9312 and Rrock2139 the same person?

I checked common-voice/sentence-collector.json. Probably the current Japanese source text isn't all in the public domain. That's a shame.

If the current 5,528 sentence is deleted, will the voice no longer be able to be recorded? (i.e., less than 5,000 sentences)
If the full text is not available, will it also be removed from the dataset, meaning the dataset cannot be provided? (so, we can't use already recorded voices?)
Perhaps the current Sentence Collector has the above source and the text I've added, but are there any other texts? I'd like to check if there is one ...... I don't trust the old collection. (Of course, I want a third party to verify the source I used too.)

We may need to verify the source provided by navta. This person participates in other languages as well (at least in the English collection).
Seriously, we should ask for volunteers to verify all language sources (I really want the Mozilla staff to verify it, but they'll need time to do so). There are too many resources to lose, including the voice, the dataset itself, etc.
- Go to the source and check the license. Randomly extract text from the source and search the web for it (to check that the source is not an unauthorized reproduction).
- However, it's useless unless it's verified by a trusted person. Just pressing the confirm button is not proof.
commonvoice.mozilla.org should mention this danger. They are now reading text that is not in the public domain. Users of the dataset are using data that they can't use.

This is insane, and a betrayal to the people.