Text Corpus for Collection

I consider that the collection and validation of texts should be done by people who are fluent in that language.
So I might suggest a corpus other than my native language, but it should be validated by people fluent in that language to see if it is a valid corpus.

I've already had my share of painful experiences with Japanese collections, and I don't want anyone else to make the same mistake.


The White House

Copyright Policy | The White House:

Pursuant to federal law, government-produced materials appearing on this site are not copyright protected. The United States Government may receive and hold copyrights transferred to it by assignment, bequest, or otherwise.

Seriously? I'm not familiar with the law, but can we add this to the list of appropriate corpus?

2020-11-15: I didn't know that, but it seems that copyright doesn't apply to law statements, news reports, etc. in Japan either.

What about your language? It's not "conversational" by any means, but we can increase the number.

Internet Archive

Most of the resources may be useful. But there is also content that is in clear violation of rights. (For example, to narrow it down to Japanese, there are videos of famous anime and pictures of idols.)
If we can verify each one of these licenses, we may be able to add them to the list, but what do you think?

datos.bne.es

Spanish. I can't read.
Maybe it's CC0, but I don't know.
The Biblioteca Nacional de España comes up somewhat by searching for CC0.

Is there anyone who can look it up?
Are there any sentences that could be used for these resources?

Mozilla

I'm sure there are people here who are familiar with Mozilla, so I'm going to ask: Are there any CC0 works in Mozilla's public resources?

All I could find was an MDN code sample: Code samples and snippets - About MDN Web Docs - The MDN project | MDN

OSCAR

I mention this because I know some of you may want to add it to the list.

I've downloaded the Japanese file.
Some of the text had unique, identifiable sentences; a quick Google search shows that they were extracted from personal sites, corporate promotions, reports of charitable activities, porn sites, etc. There were also a lot of proper nouns (names of identifiable individuals, groups and works).

I have contacted OSCAR about this and am waiting to hear back. (What process did they use to get the text, to check if it's legitimate, etc.)
But whatever the reply, I will not add OSCAR to the list.

If you think it's appropriate, useful, and worthy of use in other languages, please add it to the list, with the Note field mentioning "Japanese files have concerns".

In Japanese

リストに追加したい人もいると思うので、触れておきます。

私は日本語ファイルをダウンロードしました。
コーパスの一部には、独特の、特定可能な文章がありました。Googleで検索してみると、それらは個人サイト、企業の宣伝、慈善団体の活動報告、ポルノサイト等から抽出されたことがわかります。固有名詞(特定可能な個人や集団、作品の名前)もたくさんありました。

私はこの件についてOSCARに問い合わせ、返事を待っているところです。
でも、どのような返事であれ、私はOSCARをリストに追加しません。

もし他の言語では適切で、有用であり、使用に値するというのであれば、「日本語ファイルには懸念がある」とNote欄に記載した上で、リストに追加して下さい。

To: All

Write rather than Translate

I said, Translation is not easy, but it can be a good alternative .
However, as @nukeador has already mentioned in Post #2 on Problems finding public domain sentences, it would be more efficient and the quality of the text would be more reliable if you create your own text rather than translate it.
Translation requires an understanding of the foreign language and the ability to edit the words properly.
For example, you can use a machine to do automatic translation and then rework the generated text into something completely different by yourself. To the point of using it as a "material", I think anyone can make a corpus of foreign languages useful.

You may not be comfortable with the idea of writing it yourself.
But, as I mentioned in Ideas for finding public domain text, it can be done by tweet, chat or email.
If you don't have that either, then it can be a description of an everyday action you're doing, a landscape. Like, "My neighbor's dog is annoying," or "I posted this on the forum but no one is liking it". Your soliloquy will help everyone.
I like those sentences, they're easy and many people will enjoy reading them.

Everyone has the secret to creating a corpus.


青空文庫Aozora Bunko

In Japanese

青空文庫には、著作権切れの作品を電子化したコンテンツがあります。
青空文庫から返信を頂きましたが、「取り扱い規準」のメタ情報やクレジットの希望は、あくまで「希望・期待」である、とのことでした。
著作権の発生する規定が「思想又は感情を創作的に表現したもの」で、著作物に「創作性」が必要なことを考慮すると、青空文庫内の著作権切れ作品は、パブリックドメインのままである(=青空文庫は著作者ではない)と私は判断しました。

収集の際には、作品がパブリックドメインであるか確認することを忘れないで下さい(青空文庫には、著作権が存続する作品もあります。註記もありますが、収集する人が必ず自分で確認して下さい)。

著作権の解釈に関して誤りがあれば、ご指摘下さい。

註:旧字(正字)・旧仮名(歴史的仮名遣、正仮名遣)は、今日日常的には用いられていないので、リストから「新字新仮名」版を探すか、収集する人が新字・新仮名(現代仮名遣)に編集して下さい。

参考 (Japanese Copyright Law):著作権法 - e-Gov法令検索

In English

Aozora Bunko contains digital versions of out-of-copyright works.
I received a reply from Aozora Bunko, stating that the metadata and credit wishes in the "取り扱い規準 (Handling Standards)" are just "hopes and expectations".
Considering the fact that the provision that creates copyright is "思想又は感情を創作的に表現したもの (creative expression of thought or feeling)" and that works need to be "創作性 (creative)", I decided that the out-of-copyright works in Aozora Bunko remain in the public domain (i.e. Aozora Bunko is not the author).

When collecting, don't forget to check whether the work is in the public domain (some works in the Aozora Bunko are still in the copyright. There are notes, but collectors should always check for themselves).

Please point out any errors in the interpretation of copyright.

Note: "旧字旧仮名" (Kyūjitai and Historical kana orthography) is not commonly used in Japan today, so please look for the "新字新仮名" version in the list, or edit it to Shinjitai and Modern kana usage by the collector.