4.3.2. 其他语言¶

4.3.2.1. 多语种 NLP 框架¶

UDPipe is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files. Primarily written in C++, offers a fast and reliable solution for multilingual NLP processing.
NLP-Cube : Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing. New platform, written in Python with Dynet 2.0. Offers standalone (CLI/Python bindings) and server functionality (REST API).

4.3.2.2. NLP-朝鲜语¶

4.3.2.2.1. 朝鲜语库¶

KoNLPy - Python package for Korean natural language processing.
Mecab (Korean) - C++ library for Korean NLP
KoalaNLP - Scala library for Korean Natural Language Processing.
KoNLP - R package for Korean Natural language processing

4.3.2.2.2. 朝鲜语博客和教程¶

4.3.2.2.3. 朝鲜语数据集¶

KAIST Corpus - A corpus from the Korea Advanced Institute of Science and Technology in Korean.
Naver Sentiment Movie Corpus in Korean
Chosun Ilbo archive - dataset in Korean from one of the major newspapers in South Korea, the Chosun Ilbo.

4.3.2.3. NLP-阿拉伯语¶

4.3.2.3.1. 阿拉伯语库¶

goarabic - Go package for Arabic text processing
jsastem - Javascript for Arabic stemming
PyArabic - Python libraries for Arabic

4.3.2.3.2. 阿拉伯语数据集¶

Multidomain Datasets - Largest Available Multi-Domain Resources for Arabic Sentiment Analysis
LABR - LArge Arabic Book Reviews dataset
Arabic Stopwords - A list of Arabic stopwords from various resources

4.3.2.4. NLP-中文¶

4.3.2.4.1. 中文分词库¶

jieba - 用于中文词语分割实用程序的 Python 包
SnowNLP - 中文 NLP 的 Python 包
FudanNLP - 用于中文文本处理的 Java 库(已作废转 fastNLP)
fastNLP - 模块化和可扩展的 NLP 框架。目前仍处于孵化期。
kcws - 深度学习中文单词段

4.3.2.5. NLP-德语¶

German-NLP - Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

4.3.2.6. NLP-西班牙语¶

4.3.2.6.1. 数据¶

4.3.2.7. NLP-印度语¶

4.3.2.7.1. Hindi¶

4.3.2.7.2. 数据，Corpora 和 Treebanks¶

Hindi 依赖树库 - A multi-representational multi-layered treebank for Hindi and Urdu
在印地语的普遍依赖性树库
- 在印地语中的并行通用依赖树库 - A smaller part of the above-mentioned treebank.

4.3.2.8. NLP-泰语¶

4.3.2.8.1. 泰语库¶

PyThaiNLP - Thai NLP in Python Package
JTCC - A character cluster library in Java
CutKum - Word segmentation with deep learning in TensorFlow
泰语工具包 - Based on a paper by Wirote Aroonmanakun in 2002 with included dataset
SynThai - Word segmentation and POS tagging using deep learning in Python

4.3.2.8.2. 泰语数据¶

Inter-BEST - A text corpus with 5 million words with word segmentation
Prime Minister 29 - Dataset containing speeches of the current Prime Minister of Thailand

4.3.2.9. NLP-丹麦语¶

丹麦语的命名实体识别

4.3.2.10. NLP-越南语¶

4.3.2.10.1. 越南语库¶

underthesea - Vietnamese NLP Toolkit
vn.vitk - A Vietnamese Text Processing Toolkit
VnCoreNLP - A Vietnamese natural language processing toolkit

4.3.2.10.2. 越南语数据¶

Vietnamese treebank - 10,000 sentences for the constituency parsing task
BKTreeBank - a Vietnamese Dependency Treebank
UD_Vietnamese - Vietnamese Universal Dependency Treebank
VIVOS - a free Vietnamese speech corpus consisting of 15 hours of recording speech by AILab
VNTQcorpus(big).txt - 1.75 million sentences in news

4.3.2.11. NLP-印度尼西亚语¶

4.3.2.11.1. 印度尼西亚语数据集¶

Kompas and Tempo collections at ILPS
PANL10N 用于 PoS 标记: 39K sentences and 900K word tokens
IDN for PoS tagging: This corpus contains 10K sentences and 250K word tokens
Indonesian Treebank and Universal Dependencies-Indonesian
IndoSum for text summarization and classification both
Wordnet-Bahasa - large, free, semantic dictionary

4.3.2.11.2. 库和嵌入¶

自然语言工具包bahasa
印尼语嵌入
在维基百科上训练的预训练印尼快速文本文本嵌入

4.3.2.11.3. 其他语言¶

俄语: pymorphy2 - 一个很好的俄语定位器
亚洲语言: 泰国, Lao, 中文, 日本, 和韩国 ICU Tokenizer implementation in ElasticSearch
古代语言: CLTK: Classical Language Toolkit 是一个 Python 库和用于在古代语言中进行 NLP 的文本集合
荷兰语: python-frog - Python 绑定到 Frog，一个荷兰语的 NLP 套件。 (pos 标记，词形还原，依赖解析，NEAR)
希伯来语: NLPH_Resources - 希伯来语 NLP 的论文，语料库和语言资源的集合