Text Corpus Link Collection [First edition]

Note: The original was a wiki post and the information may have already been updated.


Collection of links to the text corpus. Add to it freely.
Even if we don't have a corpus in your language, it is in the public domain and we can translate it. Translation is not easy, but it can be a good alternative.
Of course, it will also help those who use it for purposes other than Common Voice.

Sentence collection is the origin of voice recordings and datasets and is an important part of Common Voice. Please share the corpus that you know and help all.


How to fill each field

  1. Corpus: Link to the corpus.
    • Link to the content we (will) collect.
      • If a list exists, link to the root of the list (a page with full view of the public domain works).
      • If only a portion of the content is in the public domain, mention that in the Note field.
    • Be precise and concise with the name of the corpus.
      • If you don't know what it is, write the title of the page. (Page headings, browser title bar, etc.)
      • If there is a specific version of the corpus, state that as well.
      • e.g. The Sinumade Book of Adventures (2020 edition)
  2. Language: The language must be written as indicated in the Sentence Collector. For example, Chinese is Chinese for any region.
    • If there is more than one, separate them with a comma. They are written in alphabetical order. Example: English, French, German
    • If the language is not in the Sentence Collector, put a + sign on the name of the language. Example: English+
  3. State: If possible, mark the following:
    • CC0: The corpus text indicates permission. Or, it links to a document that indicates permission.
    • PD: Public Domain. It is mainly assumed to be a work whose copyright has expired. If the copyright holder has waived the rights to the work, make it CC0.
  4. Permission: A link to a document that indicates the corpus permission.
    • Related documents other than the permission should be written in the Note field.
  5. Note: What to consider about collection. For example, there are limitations on collection (e.g., only part of it can be collected) or that it needs to be edited.

Appropriate corpus

A corpus that has been confirmed to be in the public domain.

Appropriate corpus
CorpusLanguageStatePermissionNote
Wikipedia There are limitations. See: Sentence Extractor - Current Status and Workflow Summary
GitHub - irvin/cc0-sentences Chinese CC0
mlog English CC0 Everything by me – Happy GNU Year & Public Domain Day – mlog
zen habits English CC0 Uncopyright : zen habits Leo's ebooks are also in the public domain.
mnmlist English CC0 » uncopyright mnmlist
星空文庫 Japanese CC0 Only Public Domain category can be collected
deztec.jp Japanese CC0 Info/趣味のWebデザイン Contacted 2020-10-28
死ぬまで憶えておいて Japanese CC0 sinumade.net 槪要 Written in minor wording (Historical kana orthography and Kyūjitai) and need to be edited

Candidate corpus (DO NOT USE this corpus)

A corpus that has not been confirmed to be in the public domain.

Candidate corpus (DO NOT USE this corpus)
CorpusLanguageStatePermissionNote

Invalid corpus (DO NOT USE this corpus)

A corpus that must not be used.
For example, a corpus that was used but found to be inappropriate.

Invalid corpus (DO NOT USE this corpus)
CorpusLanguageStatePermissionNote
Tanaka Corpus (Public Domain version) English, Japanese

Supplement


Matters for consideration


Note

  1. The goal is to report each other's corpus so that we can be more efficient and active in each language's collection activities. There is a concern that sharing information may lead to confusion in the work, but I would like to ask for your opinions on this issue.
  2. Particularly with regard to the invalid corpus, it should be shared so as not to waste the volunteer's efforts.
  3. It also aims to find inappropriate corpus.