Text Corpus for Collection

I consider that the collection and validation of texts should be done by people who are fluent in that language.
So I might suggest a corpus other than my native language, but it should be validated by people fluent in that language to see if it is a valid corpus.

I've already had my share of painful experiences with Japanese collections, and I don't want anyone else to make the same mistake.

The White House

Copyright Policy | The White House:

Pursuant to federal law, government-produced materials appearing on this site are not copyright protected. The United States Government may receive and hold copyrights transferred to it by assignment, bequest, or otherwise.

Seriously? I'm not familiar with the law, but can we add this to the list of appropriate corpus?

2020-11-15: I didn't know that, but it seems that copyright doesn't apply to law statements, news reports, etc. in Japan either.

What about your language? It's not "conversational" by any means, but we can increase the number.

Internet Archive

Most of the resources may be useful. But there is also content that is in clear violation of rights. (For example, to narrow it down to Japanese, there are videos of famous anime and pictures of idols.)
If we can verify each one of these licenses, we may be able to add them to the list, but what do you think?


Spanish. I can't read.
Maybe it's CC0, but I don't know.
The Biblioteca Nacional de España comes up somewhat by searching for CC0.

Is there anyone who can look it up?
Are there any sentences that could be used for these resources?


I'm sure there are people here who are familiar with Mozilla, so I'm going to ask: Are there any CC0 works in Mozilla's public resources?

All I could find was an MDN code sample: Code samples and snippets - About MDN Web Docs - The MDN project | MDN


I mention this because I know some of you may want to add it to the list.

I've downloaded the Japanese file.
Some of the text had unique, identifiable sentences; a quick Google search shows that they were extracted from personal sites, corporate promotions, reports of charitable activities, porn sites, etc. There were also a lot of proper nouns (names of identifiable individuals, groups and works).

I have contacted OSCAR about this and am waiting to hear back. (What process did they use to get the text, to check if it's legitimate, etc.)
But whatever the reply, I will not add OSCAR to the list.

If you think it's appropriate, useful, and worthy of use in other languages, please add it to the list, with the Note field mentioning "Japanese files have concerns".

In Japanese





To: All

Write rather than Translate

I said, Translation is not easy, but it can be a good alternative .
However, as @nukeador has already mentioned in Post #2 on Problems finding public domain sentences, it would be more efficient and the quality of the text would be more reliable if you create your own text rather than translate it.
Translation requires an understanding of the foreign language and the ability to edit the words properly.
For example, you can use a machine to do automatic translation and then rework the generated text into something completely different by yourself. To the point of using it as a "material", I think anyone can make a corpus of foreign languages useful.

You may not be comfortable with the idea of writing it yourself.
But, as I mentioned in Ideas for finding public domain text, it can be done by tweet, chat or email.
If you don't have that either, then it can be a description of an everyday action you're doing, a landscape. Like, "My neighbor's dog is annoying," or "I posted this on the forum but no one is liking it". Your soliloquy will help everyone.
I like those sentences, they're easy and many people will enjoy reading them.

Everyone has the secret to creating a corpus.

青空文庫Aozora Bunko

In Japanese





参考 (Japanese Copyright Law):著作権法 - e-Gov法令検索

In English

Aozora Bunko contains digital versions of out-of-copyright works.
I received a reply from Aozora Bunko, stating that the metadata and credit wishes in the "取り扱い規準 (Handling Standards)" are just "hopes and expectations".
Considering the fact that the provision that creates copyright is "思想又は感情を創作的に表現したもの (creative expression of thought or feeling)" and that works need to be "創作性 (creative)", I decided that the out-of-copyright works in Aozora Bunko remain in the public domain (i.e. Aozora Bunko is not the author).

When collecting, don't forget to check whether the work is in the public domain (some works in the Aozora Bunko are still in the copyright. There are notes, but collectors should always check for themselves).

Please point out any errors in the interpretation of copyright.

Note: "旧字旧仮名" (Kyūjitai and Historical kana orthography) is not commonly used in Japan today, so please look for the "新字新仮名" version in the list, or edit it to Shinjitai and Modern kana usage by the collector.