4.3.2. 其他语言

4.3.2.1. 多语种 NLP 框架

  • UDPipe is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files. Primarily written in C++, offers a fast and reliable solution for multilingual NLP processing.

  • NLP-Cube : Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing. New platform, written in Python with Dynet 2.0. Offers standalone (CLI/Python bindings) and server functionality (REST API).

4.3.2.2. NLP-朝鲜语

4.3.2.2.1. 朝鲜语库

  • KoNLPy - Python package for Korean natural language processing.

  • Mecab (Korean) - C++ library for Korean NLP

  • KoalaNLP - Scala library for Korean Natural Language Processing.

  • KoNLP - R package for Korean Natural language processing

4.3.2.2.3. 朝鲜语数据集

4.3.2.3. NLP-阿拉伯语

4.3.2.3.1. 阿拉伯语库

  • goarabic - Go package for Arabic text processing

  • jsastem - Javascript for Arabic stemming

  • PyArabic - Python libraries for Arabic

4.3.2.3.2. 阿拉伯语数据集

  • Multidomain Datasets - Largest Available Multi-Domain Resources for Arabic Sentiment Analysis

  • LABR - LArge Arabic Book Reviews dataset

  • Arabic Stopwords - A list of Arabic stopwords from various resources

4.3.2.4. NLP-中文

4.3.2.4.1. 中文分词库

  • jieba - 用于中文词语分割实用程序的 Python 包

  • SnowNLP - 中文 NLP 的 Python 包

  • FudanNLP - 用于中文文本处理的 Java 库(已作废转 fastNLP)

  • fastNLP - 模块化和可扩展的 NLP 框架。 目前仍处于孵化期。

  • kcws - 深度学习中文单词段

4.3.2.5. NLP-德语

  • German-NLP - Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

4.3.2.7. NLP-印度语

4.3.2.7.1. Hindi

4.3.2.7.2. 数据,Corpora 和 Treebanks

4.3.2.8. NLP-泰语

4.3.2.8.1. 泰语库

  • PyThaiNLP - Thai NLP in Python Package

  • JTCC - A character cluster library in Java

  • CutKum - Word segmentation with deep learning in TensorFlow

  • 泰语工具包 - Based on a paper by Wirote Aroonmanakun in 2002 with included dataset

  • SynThai - Word segmentation and POS tagging using deep learning in Python

4.3.2.8.2. 泰语数据

  • Inter-BEST - A text corpus with 5 million words with word segmentation

  • Prime Minister 29 - Dataset containing speeches of the current Prime Minister of Thailand

4.3.2.9. NLP-丹麦语

4.3.2.10. NLP-越南语

4.3.2.10.1. 越南语库

  • underthesea - Vietnamese NLP Toolkit

  • vn.vitk - A Vietnamese Text Processing Toolkit

  • VnCoreNLP - A Vietnamese natural language processing toolkit

4.3.2.10.2. 越南语数据

4.3.2.11. NLP-印度尼西亚语

4.3.2.11.1. 印度尼西亚语数据集

4.3.2.11.2. 库和嵌入

4.3.2.11.3. 其他语言

  • 俄语: pymorphy2 - 一个很好的俄语定位器

  • 亚洲语言: 泰国, Lao, 中文, 日本, 和韩国 ICU Tokenizer implementation in ElasticSearch

  • 古代语言: CLTK: Classical Language Toolkit 是一个 Python 库和用于在古代语言中进行 NLP 的文本集合

  • 荷兰语: python-frog - Python 绑定到 Frog,一个荷兰语的 NLP 套件。 (pos 标记,词形还原,依赖解析,NEAR)

  • 希伯来语: NLPH_Resources - 希伯来语 NLP 的论文,语料库和语言资源的集合