rivertext

rive rtext

Rivert ex是一个开源库，用于建模和培训由最先进的艺术品提出的不同增量词向量体系结构。

它试图将许多现有的增量词向量算法标准化为一个统一的框架，以提供标准化的界面并促进新方法的开发。

Rivertex提供了两个培训范式：

learn_one ，一次训练一个实例；
和learn_many ，一次训练一个小批量的实例。

这允许使用文本数据流对文本表示模型进行更有效的培训。

rivertext还提供了类似于river套件的接口，使开发人员可以轻松地使用库快速，轻松地训练文本表示模型。

正式文档可以在此链接上找到。

安装

rivertext目的是与Python 3.10及以上合作。可以通过pip完成安装：

 pip install rivertext

要求

这些软件包将与软件包一起安装，以防这些软件包：这些软件包尚未安装：

NLTK
numpy
河
Scikit_learn
Scipy
火炬
TQDM
单词插件基准

贡献

开发要求

测试

所有单元测试均在rivertext /Tests文件夹中。它使用pytest作为框架来运行它们。

要运行测试，请执行：

 pytest tests

要检查覆盖范围，请运行：

 pytest tests --cov-report xml:cov.xml --cov rivertext

进而：

 coverage report -m

构建文档

该文档是使用mkdocs和mkdocs-material创建的。它可以在项目根部的文档文件夹中找到。首先，您需要安装：

 pip install mkdocs
pip install \"mkdocstrings[python]\"
pip install mkdocs-material

然后，要编译文档，运行：

 mkdocs build
mkdocs serve

ChangElog

引用

如果您在学术出版物中使用此软件包，请引用以下论文：

G. Iturra-Bocaz和F. Bravo-Marquez rivertext ：用于培训和评估文本数据流的增量单词嵌入的Python库。在第46届国际ACM SIGIR信息检索研究与开发会议论文集（Sigir 2023），台湾台北。

rivertext: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams},
year = {2023},
isbn = {9781450394086},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://d*oi.org*/*10.1145/3539618.3591908},
doi = {10.1145/3539618.3591908},
abstract = {Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams.This paper presents rivertext , a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training.We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results.Our open-source library is available at https://g**ithub*.com/dccuchile/rivertext.},
booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {3027–3036},
numpages = {10},
keywords = {data streams, word embeddings, incremental learning},
location = {Taipei, Taiwan},
series = {SIGIR \’23}
}\”>

 @inproceedings { 10.1145/3539618.3591908 ,
author = { Iturra-Bocaz, Gabriel and Bravo-Marquez, Felipe } ,
title = { rivertext : A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams } ,
year = { 2023 } ,
isbn = { 9781450394086 } ,
publisher = { Association for Computing Machinery } ,
address = { New York, NY, USA } ,
url = { https://doi.*org/**10.1145/3539618.3591908 } ,
doi = { 10.1145/3539618.3591908 } ,
abstract = {Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams.This paper presents rivertext , a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training.We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results.Our open-source library is available at https://g**ithub*.com/dccuchile/rivertext.},
booktitle = { Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval } ,
pages = { 3027–3036 } ,
numpages = { 10 } ,
keywords = { data streams, word embeddings, incremental learning } ,
location = { Taipei, Taiwan } ,
series = { SIGIR \'23 }
}