Lists and databases
See also the J-ling resource page >> Databases
I list some datasets containing pedagogy materials (Genki, Tobira, etc.) there. Excel files are available upon request.
Text-mining
Voyant Tools: “a web-based reading and analysis environment for digital texts”
OpenRefine: “a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data”
Film transcriptions (glossed and translated, available upon request)
more to be added:
[film] My Neighbor Totoro (Excel file)
- Transcription (with Kanji and all in Hiragana), Roma-ji transliteration, linguistic glosses, English translations available
[film] Whisper of the Heart (Excel file)
- Transcription (with Kanji and all in Hiragana), Roma-ji transliteration, linguistic glosses, English translations available
[drama] Hanzawa Naoki (excel available in Japanese)
- Transcription (with Kanji and all in Hiragana), Roma-ji transliteration, linguistic glosses, English translations available
[animation] Peeping Life Library (short animation video series)
- #13 cheen rokku-goshi no yoru (transcription)
Partial linguistic data available (Japanese and English)
- [manga] What did you eat yesterday? (昨日何食べた?), vol.1, ch 1 by Fumi Yoshinaga
- [manga] Oishinbo (美味しんぼ), vol.1 “The secret of Dashi” by Tetsu Kariya
- [manga] Ametani Kantaro (さぼリーマン飴谷甘太朗) (so far Japanese only), vol. 1, ch 1 by Hagiwara Tensei
Databases (in English)
The World Atlas of Language Structures Online (WALS)
The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors. See the entry on Japanese.
Databases (in Japanese)
Looking for Japanese data for your research?
- Research institution: National Institute for Japanese Language and Linguistics (NINJAL)
Corpora
少納言 KOTONOHA 「現代日本語書き言葉均衡コーパス」 (Shoonagon BCCWJ:Balanced Corpus of Contemporary Written Japanese)
>> Corpus consisting of written Japanese from various genres. 100 million words. (To use read and “agree” to terms.)
タグ付き KYコーパス
>> Searchable L2 Japanese speech corpus