Skip to content

Text lookup utility for Japanese texts from Aozora Bunko Corpus

Notifications You must be signed in to change notification settings

ryancahildebrandt/aozora_annotator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aozora Bunko Text Annotator


Open in gitpod

Purpose

This project seeks to reduce the sometimes innumerable number of trips back and forth between your favorite Japanese dictionary and the text you're reading. The included script parses a text from the Aozora Bunko literature corpus and looks up helpful information for terms in the text via the Jotoba API


Usage

Once you've cloned this repo and installed the necessary ruby gems (found in the Gemfile), you'll need to make sure you have a copy of the Aozora Bunko database file located in the ./data directory. Once you do, you can start annotating!

The easiest way to use this tool is via the command line. From the repo directory:

#to show all cli options and arguments
ruby azb.rb -h

#to search the database and return all texts with metainfo containing "源氏物語"
ruby azb.rb -s 源氏物語

#to pull information for text 165444, perform lookups, generate annotations, and render html and plaintext documents to the outputs directory
ruby azb.rb -i 165444 

# to run the full pipeline as described above, this time with options!
ruby azb.rb -i 165444 -c -k -f 225%

Sometimes the api lookup behavior isn't perfect, so if you're planning on using this as a teaching aid or instructional materials, you can always fine tune the lookups by editing the json file after the initial lookup fetching


Dataset

The dataset used for the current project was pulled from the following:

  • Aozora Bunko Corpus for Japanese full texts
  • Jotoba and Jotoba API for looking up terms. Jotoba brings together information from a range of free sources including JMDict, Tofugu, and Tatoeba and all sources are listed here

Outputs

  • Annotation format breakdown

    • Alternating
      • One term with its annotations immediately between it and the next term
      • term (annotation) term (annotation)
    • Layered
      • One sentence with all its annotations on the following line
      • sentence
      • (sentence annotations)
      • sentence
      • (sentence annotations)
    • Parallel
      • Full text with readings rendered above and meanings below, similar to the furigana annotation style commonly used
      • (sentence readings)
      • sentence
      • (sentence meanings)
    • Side by side
      • One sentence with all its annotations displayed on the right of the page
      • sentence || (sentence annotations)
      • sentence || (sentence annotations)
  • Example outputs, generated from 三十三の死 by しづ素木:

About

Text lookup utility for Japanese texts from Aozora Bunko Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published