其他语言
========

多语种 NLP 框架
---------------

-  `UDPipe <https://github.com/ufal/udpipe>`__ is a trainable pipeline
   for tokenizing, tagging, lemmatizing and parsing Universal Treebanks
   and other CoNLL-U files. Primarily written in C++, offers a fast and
   reliable solution for multilingual NLP processing.
-  `NLP-Cube <https://github.com/adobe/NLP-Cube>`__ : Natural Language
   Processing Pipeline - Sentence Splitting, Tokenization,
   Lemmatization, Part-of-speech Tagging and Dependency Parsing. New
   platform, written in Python with Dynet 2.0. Offers standalone
   (CLI/Python bindings) and server functionality (REST API).

NLP-朝鲜语
----------

朝鲜语库
~~~~~~~~

-  `KoNLPy <http://konlpy.org>`__ - Python package for Korean natural
   language processing.
-  `Mecab (Korean) <https://eunjeon.blogspot.com/>`__ - C++ library for
   Korean NLP
-  `KoalaNLP <https://koalanlp.github.io/koalanlp/>`__ - Scala library
   for Korean Natural Language Processing.
-  `KoNLP <https://cran.r-project.org/web/packages/KoNLP/index.html>`__
   - R package for Korean Natural language processing

朝鲜语博客和教程
~~~~~~~~~~~~~~~~

-  `dsindex’s blog <https://dsindex.github.io/>`__
-  `Kangwon University’s NLP course in
   Korean <http://cs.kangwon.ac.kr/~leeck/NLP/>`__

朝鲜语数据集
~~~~~~~~~~~~

-  `KAIST
   Corpus <http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus>`__
   - A corpus from the Korea Advanced Institute of Science and
   Technology in Korean.
-  `Naver Sentiment Movie Corpus in
   Korean <https://github.com/e9t/nsmc/>`__
-  `Chosun Ilbo archive <http://srchdb1.chosun.com/pdf/i_archive/>`__ -
   dataset in Korean from one of the major newspapers in South Korea,
   the Chosun Ilbo.

NLP-阿拉伯语
------------

阿拉伯语库
~~~~~~~~~~

-  `goarabic <https://github.com/01walid/goarabic>`__ - Go package for
   Arabic text processing
-  `jsastem <https://github.com/ejtaal/jsastem>`__ - Javascript for
   Arabic stemming
-  `PyArabic <https://pypi.org/project/PyArabic/>`__ - Python libraries
   for Arabic

阿拉伯语数据集
~~~~~~~~~~~~~~

-  `Multidomain
   Datasets <https://github.com/hadyelsahar/large-arabic-sentiment-analysis-resouces>`__
   - Largest Available Multi-Domain Resources for Arabic Sentiment
   Analysis
-  `LABR <https://github.com/mohamedadaly/labr>`__ - LArge Arabic Book
   Reviews dataset
-  `Arabic Stopwords <https://github.com/mohataher/arabic-stop-words>`__
   - A list of Arabic stopwords from various resources

NLP-中文
--------

中文分词库
~~~~~~~~~~

-  `jieba <https://github.com/fxsjy/jieba#jieba-1>`__ -
   用于中文词语分割实用程序的 Python 包
-  `SnowNLP <https://github.com/isnowfy/snownlp>`__ - 中文 NLP 的 Python
   包
-  `FudanNLP <https://github.com/FudanNLP/fnlp>`__ - 用于中文文本处理的
   Java 库(已作废转 fastNLP)
-  `fastNLP <https://github.com/fastnlp/fastNLP>`__ - 模块化和可扩展的
   NLP 框架。 目前仍处于孵化期。
-  `kcws <https://github.com/koth/kcws>`__ - 深度学习中文单词段

NLP-德语
--------

-  `German-NLP <https://github.com/adbar/German-NLP>`__ - Curated list
   of open-access/open-source/off-the-shelf resources and tools
   developed with a particular focus on German

NLP-西班牙语
------------

数据
~~~~

-  `哥伦比亚政治演说 <https://github.com/dav009/LatinamericanTextResources>`__
-  `哥本哈根树队 <https://mbkromann.github.io/copenhagen-dependency-treebank/>`__
-  `西班牙语十亿字语料库与 Word2Vec
   嵌入 <https://github.com/crscardellino/sbwce>`__

NLP-印度语
----------

Hindi
~~~~~

数据，Corpora 和 Treebanks
~~~~~~~~~~~~~~~~~~~~~~~~~~

-  `Hindi 依赖树库 <https://ltrc.iiit.ac.in/treebank_H2014/>`__ - A
   multi-representational multi-layered treebank for Hindi and Urdu
-  `在印地语的普遍依赖性树库 <https://universaldependencies.org/treebanks/hi_hdtb/index.html>`__

   -  `在印地语中的并行通用依赖树库 <http://universaldependencies.org/treebanks/hi_pud/index.html>`__
      - A smaller part of the above-mentioned treebank.

NLP-泰语
--------

泰语库
~~~~~~

-  `PyThaiNLP <https://github.com/PyThaiNLP/pythainlp>`__ - Thai NLP in
   Python Package
-  `JTCC <https://github.com/wittawatj/jtcc>`__ - A character cluster
   library in Java
-  `CutKum <https://github.com/pucktada/cutkum>`__ - Word segmentation
   with deep learning in TensorFlow
-  `泰语工具包 <https://pypi.python.org/pypi/tltk/>`__ - Based on a
   paper by Wirote Aroonmanakun in 2002 with included dataset
-  `SynThai <https://github.com/KenjiroAI/SynThai>`__ - Word
   segmentation and POS tagging using deep learning in Python

泰语数据
~~~~~~~~

-  `Inter-BEST <https://www.nectec.or.th/corpus/index.php?league=pm>`__
   - A text corpus with 5 million words with word segmentation
-  `Prime Minister
   29 <https://github.com/PyThaiNLP/lexicon-thai/tree/master/thai-corpus/Prime%20Minister%2029>`__
   - Dataset containing speeches of the current Prime Minister of
   Thailand

NLP-丹麦语
----------

-  `丹麦语的命名实体识别 <https://github.com/ITUnlp/daner>`__

NLP-越南语
----------

越南语库
~~~~~~~~

-  `underthesea <https://github.com/undertheseanlp/underthesea>`__ -
   Vietnamese NLP Toolkit
-  `vn.vitk <https://github.com/phuonglh/vn.vitk>`__ - A Vietnamese Text
   Processing Toolkit
-  `VnCoreNLP <https://github.com/vncorenlp/VnCoreNLP>`__ - A Vietnamese
   natural language processing toolkit

越南语数据
~~~~~~~~~~

-  `Vietnamese
   treebank <https://vlsp.hpda.vn/demo/?page=resources&lang=en>`__ -
   10,000 sentences for the constituency parsing task
-  `BKTreeBank <https://arxiv.org/pdf/1710.05519.pdf>`__ - a Vietnamese
   Dependency Treebank
-  `UD_Vietnamese <https://github.com/UniversalDependencies/UD_Vietnamese-VTB>`__
   - Vietnamese Universal Dependency Treebank
-  `VIVOS <https://ailab.hcmus.edu.vn/vivos/>`__ - a free Vietnamese
   speech corpus consisting of 15 hours of recording speech by AILab
-  `VNTQcorpus(big).txt <http://viet.jnlp.org/download-du-lieu-tu-vung-corpus>`__
   - 1.75 million sentences in news

NLP-印度尼西亚语
----------------

印度尼西亚语数据集
~~~~~~~~~~~~~~~~~~

-  Kompas and Tempo collections at
   `ILPS <http://ilps.science.uva.nl/resources/bahasa/>`__
-  `PANL10N 用于 PoS
   标记 <http://www.panl10n.net/english/outputs/Indonesia/UI/0802/UI-1M-tagged.zip>`__:
   39K sentences and 900K word tokens
-  `IDN for PoS
   tagging <https://github.com/famrashel/idn-tagged-corpus>`__: This
   corpus contains 10K sentences and 250K word tokens
-  `Indonesian Treebank <https://github.com/famrashel/idn-treebank>`__
   and `Universal
   Dependencies-Indonesian <https://github.com/UniversalDependencies/UD_Indonesian-GSD>`__
-  `IndoSum <https://github.com/kata-ai/indosum>`__ for text
   summarization and classification both
-  `Wordnet-Bahasa <http://wn-msa.sourceforge.net/>`__ - large, free,
   semantic dictionary

库和嵌入
~~~~~~~~

-  自然语言工具包\ `bahasa <https://github.com/kangfend/bahasa>`__
-  `印尼语嵌入 <https://github.com/galuhsahid/indonesian-word-embedding>`__
-  在维基百科上训练的预训练\ `印尼快速文本文本嵌入 <https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.id.zip>`__

.. _其他语言-1:

其他语言
~~~~~~~~

-  俄语: `pymorphy2 <https://github.com/kmike/pymorphy2>`__ -
   一个很好的俄语定位器
-  亚洲语言: 泰国, Lao, 中文, 日本, 和韩国 `ICU
   Tokenizer <https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html>`__
   implementation in ElasticSearch
-  古代语言: `CLTK <https://github.com/cltk/cltk>`__: Classical Language
   Toolkit 是一个 Python 库和用于在古代语言中进行 NLP 的文本集合
-  荷兰语: `python-frog <https://github.com/proycon/python-frog>`__ -
   Python 绑定到 Frog，一个荷兰语的 NLP 套件。 (pos
   标记，词形还原，依赖解析，NEAR)
-  希伯来语: `NLPH_Resources <https://github.com/NLPH/NLPH_Resources>`__
   - 希伯来语 NLP 的论文，语料库和语言资源的集合