Skip to content

Latest commit

 

History

History
109 lines (72 loc) · 3.66 KB

README.md

File metadata and controls

109 lines (72 loc) · 3.66 KB

Clojars Project

clj-mecab

Clojure wrapper for the Japanese Morphological Analyzer MeCab.

About

A minimal wrapper around the SWIG-generated Java bindings for MeCab. Currently tested with all varieties of UniDic and IPAdic, although other dictionaries are planned.

Prerequisites

clj-mecab requires you to have MeCab (0.996) installed (the mecab-config binary is used to find your MeCab configuration) and on your path.

Package manager

On Debian:

apt get install mecab mecab-utils libmecab-java libmecab-jni unidic-mecab

On MacOS:

brew install mecab mecab-unidic

Note that you will need to manually install Maven dependencies on MacOS (see next section).

Maven dependencies

You also need to have the Java JNI (SWIG) bindings for the version of MeCab you have installed on your system installed in your local Maven repository (~/.m2). This can be accomplished by:

mvn install:install-file -DgroupId=org.chasen -DartifactId=mecab -Dpackaging=jar -Dversion=0.996 -Dfile=/usr/share/java/mecab/MeCab.jar -DgeneratePom=true

Where /usr/share/java/mecab/MeCab.jar should point to the generated jar on your system.

You will also need to manually download cmecab-java and install it into your local Maven repo:

wget /~https://github.com/takscape/cmecab-java/releases/download/2.1.0/cmecab-java-2.1.0.tar.gz
tar xzf cmecab-java-2.1.0.tar.gz
mvn install:install-file -DgroupId=net.moraleboost.cmecab-java -DartifactId=cmecab-java -Dpackaging=jar -Dversion=2.1.0 -Dfile=cmecab-java-2.1.0/cmecab-java-2.1.0.jar -DgeneratePom=true

Manually building and installing MeCab

MeCab depends on CRF++, so first install that.

wget http://crfpp.googlecode.com/files/CRF%2B%2B-0.58.tar.gz
tar xzf CRF++-0.58.tar.gz
cd CRF++-0.58 && ./configure && make -j4 && make install && cd ..

Next, install MeCab.

wget http://mecab.googlecode.com/files/mecab-0.996.tar.gz
tar xzf mecab-0.996.tar.gz
cd mecab-0.996 && ./configure --with-charset=utf8 --enable-utf8-only && make -j4 && make install && cd ..

And at least one dictionary:

  • IPAdic:

    wget http://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz
    tar xzf mecab-ipadic-2.7.0-20070801.tar.gz
    cd mecab-ipadic-2.7.0-20070801 && ./configure --with-charset=utf8 && make -j4 && make install && cd ..
  • UniDic:

    curl -O https://unidic.ninjal.ac.jp/unidic_archive/cwj/2.3.0/unidic-cwj-2.3.0.zip
    unzip -x unidic-cwj-2.3.0.zip
    cd unidic-cwj-2.3.0 && install -d $(mecab-config --dicdir)/unidic-cwj && install -m 644 dicrc *.bin *.dic $(mecab-config --dicdir)/unidic-cwj && cd ..

Usage

Interactive use:

(require '[clj-mecab.parse :as mecab])
(mecab/parse-sentence "こんにちは、世界!")

[{:orth "こんにちは", :f-type "*", :i-type "*", ...} {:orth "", :f-type "*", :i-type "*", ...} {:orth "世界", :f-type "*", :i-type "*", ...} ...]

BUGS

  • For some yet unknown reason, calling .getSurface on a Node object will not work (empty string) the first time, but will the second time. Currently this means that :orth is not generated when using IPAdic. UniDic provides the surface node in the features array and is unaffected. Probably same issue as taku910/mecab#26

License

Copyright © 2013-2020 Bor Hodošček

Distributed under the Eclipse Public License, the same as Clojure, as well as the 3-clause BSD license.