Lists and databases

See also the J-ling resource page >> Databases

 I list some datasets containing pedagogy materials (Genki, Tobira, etc.) there. Excel files are available upon request.

Text-mining

Voyant Tools: “a web-based reading and analysis environment for digital texts”

OpenRefine: “a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data”

Film transcriptions (glossed and translated, available upon request)

more to be added:

[film] My Neighbor Totoro  (Excel file)

  • Transcription (with Kanji and all in Hiragana), Roma-ji transliteration, linguistic glosses, English translations available

[film] Whisper of the Heart (Excel file)

  • Transcription (with Kanji and all in Hiragana), Roma-ji transliteration, linguistic glosses, English translations available

[drama] Hanzawa Naoki (excel available in Japanese)

  • Transcription (with Kanji and all in Hiragana), Roma-ji transliteration, linguistic glosses, English translations available

[animation] Peeping Life Library (short animation video series)

  • #13 cheen rokku-goshi no yoru (transcription)

Partial linguistic data available (Japanese and English) 

  • [manga] What did you eat yesterday? (昨日何食べた?), vol.1, ch 1 by Fumi Yoshinaga
  • [manga] Oishinbo (美味しんぼ), vol.1 “The secret of Dashi” by Tetsu Kariya
  • [manga] Ametani Kantaro (さぼリーマン飴谷甘太朗) (so far Japanese only), vol. 1, ch 1 by Hagiwara Tensei

Databases (in English)

The World Atlas of Language Structures Online (WALS)

The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.  See the entry on Japanese.

Databases (in Japanese)

Looking for Japanese data for your research?

Corpora
少納言 KOTONOHA 「現代日本語書き言葉均衡コーパス」 (Shoonagon BCCWJ:Balanced Corpus of Contemporary Written Japanese)
>> Corpus consisting of written Japanese from various genres. 100 million words. (To use read and “agree” to terms.) 
タグ付き KYコーパス
>> Searchable L2 Japanese speech corpus