Databases containing Japanese/lgs. spoken in Japan

The World Atlas of Language Structures Online (WALS)

“The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.”  [Check out the entry on Japanese.]

The World Loanword Database (WOLD)

The World Loanword Database “provides vocabularies (mini-dictionaries of about 1000-2000 entries) of 41 languages from around the world, with comprehensive information about the loanword status of each word. It allows users to find loanwordssource words and donor languages in each of the 41 languages, but also makes it easy to compare loanwords across languages.”[Includes Ainu and Japanese]

Surrey Lexical Splits Database

“This database was created by the Surrey Morphology Group (University of Surrey) as part of the AHRC-funded project ‘Lexical splits: a novel perspective on the structure of words’, to illustrate the wonderful diversity we find, in languages right across the world, in how the different forms of a single word can vary.” [Includes Ainu, an indigenous language in Japan]

Compilations of Databases

Open Language Archives Community (OLAC): “provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.” Search Japanese from this site.

See also the J-ling resource page: I list some datasets containing linguistics and language materials (Genki, Tobira, etc.) there. I will be moving the content here eventually.

Film transcriptions (glossed and translated, available upon request)

more to be added:

[film] My Neighbor Totoro  (Excel file)

  • Transcription (with Kanji and all in Hiragana), Roma-ji transliteration, lingu2istic glosses, English translations available

[film] Whisper of the Heart (Excel file)

  • Transcription (with Kanji and all in Hiragana), Roma-ji transliteration, linguistic glosses, English translations available

[drama] Hanzawa Naoki (excel available in Japanese)

  • Transcription (with Kanji and all in Hiragana), Roma-ji transliteration, linguistic glosses, English translations available

[animation] Peeping Life Library (short animation video series)

  • #13 cheen rokku-goshi no yoru (transcription)

Partial linguistic data available (Japanese and English) 

  • [manga] What did you eat yesterday? (昨日何食べた?), vol.1, ch 1 by Fumi Yoshinaga
  • [manga] Oishinbo (美味しんぼ), vol.1 “The secret of Dashi” by Tetsu Kariya
  • [manga] Ametani Kantaro (さぼリーマン飴谷甘太朗) (so far Japanese only), vol. 1, ch 1 by Hagiwara Tensei

Databases in Japanese

 National Institute for Japanese Language and Linguistics (NINJAL)

Contemporary JapaneseJapanese Dialects and Language DiversityHistory of the Japanese Language, etc.


少納言 KOTONOHA 「現代日本語書き言葉均衡コーパス」 (Shoonagon)

Balanced Corpus of Contemporary Written Japanese (BCCWJ)
“Corpus consisting of written Japanese from various genres. 100 million words.”


タグ付き KYコーパス
Searchable L2 Japanese speech corpus

Tools for Text-mining

Voyant Tools: “a web-based reading and analysis environment for digital texts”

OpenRefine: “a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data”