-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check false matches by * in Japanese #28
Comments
The chance of false match increases when we use require(quanteda)
#> Loading required package: quanteda
#> Package version: 1.4.4
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
require(newsmap)
#> Loading required package: newsmap
require(stringi)
#> Loading required package: stringi
lis <- as.list(data_dictionary_newsmap_ja, TRUE, 3) %>%
lapply(function(x) stri_replace_last_fixed(x[1], "*", ""))
# followed by kanji (country names as part of demonym)
people_fixed <- unlist(lis) %>%
paste0("人") %>%
tokens() %>%
tokens_lookup(dictionary(lis)) %>%
ntoken()
people_glob <- unlist(lis) %>%
paste0("人") %>%
tokens() %>%
tokens_lookup(data_dictionary_newsmap_ja) %>%
ntoken()
(missed_people <- names(lis)[people_glob > 0 & people_fixed == 0])
#> [1] "CD" "CG" "ST" "PM" "KG" "JP" "MP"
# followed by katakana (country names as adjectives)
team_fixed <- unlist(lis) %>%
paste0("チーム") %>%
tokens() %>%
tokens_lookup(dictionary(lis)) %>%
ntoken()
team_glob <- unlist(lis) %>%
paste0("チーム") %>%
tokens() %>%
tokens_lookup(data_dictionary_newsmap_ja) %>%
ntoken()
(missed_team <- names(lis)[team_glob > 0 & team_fixed == 0])
#> [1] "MG" "YT" "CD" "CG" "ST" "AI" "BQ" "PM" "KG" "GG" "MP" "NU" "TK"
union(missed_people, missed_team) # countries that need wildcard
#> [1] "CD" "CG" "ST" "PM" "KG" "JP" "MP" "MG" "YT" "AI" "BQ" "GG" "NU" "TK" Interestingly, it is not only
|
@ClaudeGrasland here is the comparison between new and old. There are large increase in small insular countries in the new version because I treated their names as phrases. The increase in Madagascar and Germany is due to wrong translation in the old version. > diff["kh"]
kh
-0.000264131
> diff2["kh"]
kh
-0.2463005 I produced this plot in /~https://github.com/koheiw/newsmap/blob/issue-28/tests/misc/comapre-dictionaries.R |
"タイ*" for Thailand produces a lot of false matches. For example, "タイヤ" (tire), "タイム" (time), "タイミング" (timing), "タイプ" (type), "タイトル" (title), "タイガー" (tiger).
This is a good reminder that we have to careful about wildcard. We need to check words for other countries too.
The text was updated successfully, but these errors were encountered: