4.3.2. 其他语言¶
4.3.2.1. 多语种 NLP 框架¶
UDPipe is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files. Primarily written in C++, offers a fast and reliable solution for multilingual NLP processing.
NLP-Cube : Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing. New platform, written in Python with Dynet 2.0. Offers standalone (CLI/Python bindings) and server functionality (REST API).
4.3.2.2. NLP-朝鲜语¶
4.3.2.2.1. 朝鲜语库¶
KoNLPy - Python package for Korean natural language processing.
Mecab (Korean) - C++ library for Korean NLP
KoalaNLP - Scala library for Korean Natural Language Processing.
KoNLP - R package for Korean Natural language processing
4.3.2.2.2. 朝鲜语博客和教程¶
4.3.2.2.3. 朝鲜语数据集¶
KAIST Corpus - A corpus from the Korea Advanced Institute of Science and Technology in Korean.
Chosun Ilbo archive - dataset in Korean from one of the major newspapers in South Korea, the Chosun Ilbo.
4.3.2.3. NLP-阿拉伯语¶
4.3.2.3.1. 阿拉伯语库¶
4.3.2.3.2. 阿拉伯语数据集¶
Multidomain Datasets - Largest Available Multi-Domain Resources for Arabic Sentiment Analysis
LABR - LArge Arabic Book Reviews dataset
Arabic Stopwords - A list of Arabic stopwords from various resources
4.3.2.4. NLP-中文¶
4.3.2.5. NLP-德语¶
German-NLP - Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
4.3.2.7. NLP-印度语¶
4.3.2.7.1. Hindi¶
4.3.2.7.2. 数据,Corpora 和 Treebanks¶
Hindi 依赖树库 - A multi-representational multi-layered treebank for Hindi and Urdu
-
在印地语中的并行通用依赖树库 - A smaller part of the above-mentioned treebank.
4.3.2.8. NLP-泰语¶
4.3.2.8.1. 泰语库¶
4.3.2.8.2. 泰语数据¶
Inter-BEST - A text corpus with 5 million words with word segmentation
Prime Minister 29 - Dataset containing speeches of the current Prime Minister of Thailand
4.3.2.9. NLP-丹麦语¶
4.3.2.10. NLP-越南语¶
4.3.2.10.1. 越南语库¶
underthesea - Vietnamese NLP Toolkit
vn.vitk - A Vietnamese Text Processing Toolkit
VnCoreNLP - A Vietnamese natural language processing toolkit
4.3.2.10.2. 越南语数据¶
Vietnamese treebank - 10,000 sentences for the constituency parsing task
BKTreeBank - a Vietnamese Dependency Treebank
UD_Vietnamese - Vietnamese Universal Dependency Treebank
VIVOS - a free Vietnamese speech corpus consisting of 15 hours of recording speech by AILab
VNTQcorpus(big).txt - 1.75 million sentences in news
4.3.2.11. NLP-印度尼西亚语¶
4.3.2.11.1. 印度尼西亚语数据集¶
Kompas and Tempo collections at ILPS
PANL10N 用于 PoS 标记: 39K sentences and 900K word tokens
IDN for PoS tagging: This corpus contains 10K sentences and 250K word tokens
IndoSum for text summarization and classification both
Wordnet-Bahasa - large, free, semantic dictionary
4.3.2.11.2. 库和嵌入¶
自然语言工具包bahasa
在维基百科上训练的预训练印尼快速文本文本嵌入
4.3.2.11.3. 其他语言¶
俄语: pymorphy2 - 一个很好的俄语定位器
亚洲语言: 泰国, Lao, 中文, 日本, 和韩国 ICU Tokenizer implementation in ElasticSearch
古代语言: CLTK: Classical Language Toolkit 是一个 Python 库和用于在古代语言中进行 NLP 的文本集合
荷兰语: python-frog - Python 绑定到 Frog,一个荷兰语的 NLP 套件。 (pos 标记,词形还原,依赖解析,NEAR)
希伯来语: NLPH_Resources - 希伯来语 NLP 的论文,语料库和语言资源的集合