其他语言 ======== 多语种 NLP 框架 --------------- - `UDPipe `__ is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files. Primarily written in C++, offers a fast and reliable solution for multilingual NLP processing. - `NLP-Cube `__ : Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing. New platform, written in Python with Dynet 2.0. Offers standalone (CLI/Python bindings) and server functionality (REST API). NLP-朝鲜语 ---------- 朝鲜语库 ~~~~~~~~ - `KoNLPy `__ - Python package for Korean natural language processing. - `Mecab (Korean) `__ - C++ library for Korean NLP - `KoalaNLP `__ - Scala library for Korean Natural Language Processing. - `KoNLP `__ - R package for Korean Natural language processing 朝鲜语博客和教程 ~~~~~~~~~~~~~~~~ - `dsindex’s blog `__ - `Kangwon University’s NLP course in Korean `__ 朝鲜语数据集 ~~~~~~~~~~~~ - `KAIST Corpus `__ - A corpus from the Korea Advanced Institute of Science and Technology in Korean. - `Naver Sentiment Movie Corpus in Korean `__ - `Chosun Ilbo archive `__ - dataset in Korean from one of the major newspapers in South Korea, the Chosun Ilbo. NLP-阿拉伯语 ------------ 阿拉伯语库 ~~~~~~~~~~ - `goarabic `__ - Go package for Arabic text processing - `jsastem `__ - Javascript for Arabic stemming - `PyArabic `__ - Python libraries for Arabic 阿拉伯语数据集 ~~~~~~~~~~~~~~ - `Multidomain Datasets `__ - Largest Available Multi-Domain Resources for Arabic Sentiment Analysis - `LABR `__ - LArge Arabic Book Reviews dataset - `Arabic Stopwords `__ - A list of Arabic stopwords from various resources NLP-中文 -------- 中文分词库 ~~~~~~~~~~ - `jieba `__ - 用于中文词语分割实用程序的 Python 包 - `SnowNLP `__ - 中文 NLP 的 Python 包 - `FudanNLP `__ - 用于中文文本处理的 Java 库(已作废转 fastNLP) - `fastNLP `__ - 模块化和可扩展的 NLP 框架。 目前仍处于孵化期。 - `kcws `__ - 深度学习中文单词段 NLP-德语 -------- - `German-NLP `__ - Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German NLP-西班牙语 ------------ 数据 ~~~~ - `哥伦比亚政治演说 `__ - `哥本哈根树队 `__ - `西班牙语十亿字语料库与 Word2Vec 嵌入 `__ NLP-印度语 ---------- Hindi ~~~~~ 数据,Corpora 和 Treebanks ~~~~~~~~~~~~~~~~~~~~~~~~~~ - `Hindi 依赖树库 `__ - A multi-representational multi-layered treebank for Hindi and Urdu - `在印地语的普遍依赖性树库 `__ - `在印地语中的并行通用依赖树库 `__ - A smaller part of the above-mentioned treebank. NLP-泰语 -------- 泰语库 ~~~~~~ - `PyThaiNLP `__ - Thai NLP in Python Package - `JTCC `__ - A character cluster library in Java - `CutKum `__ - Word segmentation with deep learning in TensorFlow - `泰语工具包 `__ - Based on a paper by Wirote Aroonmanakun in 2002 with included dataset - `SynThai `__ - Word segmentation and POS tagging using deep learning in Python 泰语数据 ~~~~~~~~ - `Inter-BEST `__ - A text corpus with 5 million words with word segmentation - `Prime Minister 29 `__ - Dataset containing speeches of the current Prime Minister of Thailand NLP-丹麦语 ---------- - `丹麦语的命名实体识别 `__ NLP-越南语 ---------- 越南语库 ~~~~~~~~ - `underthesea `__ - Vietnamese NLP Toolkit - `vn.vitk `__ - A Vietnamese Text Processing Toolkit - `VnCoreNLP `__ - A Vietnamese natural language processing toolkit 越南语数据 ~~~~~~~~~~ - `Vietnamese treebank `__ - 10,000 sentences for the constituency parsing task - `BKTreeBank `__ - a Vietnamese Dependency Treebank - `UD_Vietnamese `__ - Vietnamese Universal Dependency Treebank - `VIVOS `__ - a free Vietnamese speech corpus consisting of 15 hours of recording speech by AILab - `VNTQcorpus(big).txt `__ - 1.75 million sentences in news NLP-印度尼西亚语 ---------------- 印度尼西亚语数据集 ~~~~~~~~~~~~~~~~~~ - Kompas and Tempo collections at `ILPS `__ - `PANL10N 用于 PoS 标记 `__: 39K sentences and 900K word tokens - `IDN for PoS tagging `__: This corpus contains 10K sentences and 250K word tokens - `Indonesian Treebank `__ and `Universal Dependencies-Indonesian `__ - `IndoSum `__ for text summarization and classification both - `Wordnet-Bahasa `__ - large, free, semantic dictionary 库和嵌入 ~~~~~~~~ - 自然语言工具包\ `bahasa `__ - `印尼语嵌入 `__ - 在维基百科上训练的预训练\ `印尼快速文本文本嵌入 `__ .. _其他语言-1: 其他语言 ~~~~~~~~ - 俄语: `pymorphy2 `__ - 一个很好的俄语定位器 - 亚洲语言: 泰国, Lao, 中文, 日本, 和韩国 `ICU Tokenizer `__ implementation in ElasticSearch - 古代语言: `CLTK `__: Classical Language Toolkit 是一个 Python 库和用于在古代语言中进行 NLP 的文本集合 - 荷兰语: `python-frog `__ - Python 绑定到 Frog,一个荷兰语的 NLP 套件。 (pos 标记,词形还原,依赖解析,NEAR) - 希伯来语: `NLPH_Resources `__ - 希伯来语 NLP 的论文,语料库和语言资源的集合