可解释的多模式融合,可从文本和语音中检测到痴呆症检测
该存储库包含TSD 2024纸的代码,可解释的多模式融合,用于从文本和语音中检测到痴呆症。我们处理文本和语音特征的多模式痴呆检测问题。
概述
我们提出了将跨模式的注意力应用于痴呆症检测问题的结果,也剖析了文本和语音方式的解释性。
将音频转换为MEL光谱图后,我们用视觉变压器编码频谱图。相应的成绩单用罗伯塔(Roberta)编码。
设置
对于两个架构,将CD降为目录并运行run.py
解释性
对于文本解释性部分,我们为罗伯塔(Roberta)使用了良好的旧石灰。对于频谱图解释性,我们使用了注意推出方法。
一目了然
在以下石灰结果中,蓝色表示表示对照组的令牌,而橙色表示主要由AD患者使用的令牌。该转录本和频谱图属于对照组患者,石灰可视化指向重复的单词和主要单词。 VIT介绍了音频的语音和沉默部分。
引用
@InProceedings{10.1007/978-3-031-70566-3_21,
author=\"Altinok, Duygu\",
editor=\"N{\\\"o}th, Elmar
and Hor{\\\'a}k, Ale{\\v{s}}
and Sojka, Petr\",
title=\"Explainable Multimodal Fusion for Dementia Detection From Text and Speech\",
booktitle=\"Text, Speech, and Dialogue\",
year=\"2024\",
publisher=\"Springer Nature Switzerland\",
address=\"Cham\",
pages=\"236--251\",
abstract=\"Alzheimer\'s dementia (AD) has significant negative impacts on patients, their families, and society as a whole, both psychologically and economically. Recent research has explored combining speech and transcript modalities to leverage linguistic and acoustic features. However, many existing multimodal studies simply combine speech and text representations, use majority voting, or average predictions from separately trained text and speech models. To overcome these limitations, our article focuses on explainability and investigates the fusion of speech and text modalities using cross-attention. We convert audio to Log-Mel spectrograms and utilize text and image transformers (RoBERTa and ViT) for processing transcripts and spectrograms, respectively. By incorporating a cross-attention layer, we analyze the impact on accuracy. Our multimodal fusion model achieves 90.01{\\%} accuracy on the ADReSS Challenge dataset. Additionally, we explore the explainability of both modalities through transformer visualization techniques and an analysis of the vocabulary used by dementia and non-dementia classes.\",
isbn=\"978-3-031-70566-3\"
}
