CMTNET:CMTNet: A Collaborative Mamba-Transformer Network with Spatial-Temporal Cross-Fusion for Speech Emotion Recognition
Code by: Shihe Dong, Jiajun Wei, Junfeng Zhao, Yibing Zhu, Jiayi Zhou
CMTNET_FOR_SER
├── features_extraction
│ ├── database.py
│ ├── features_util.py
│ └── run_extract_features.py
├── models
│ ├── Mamba
│ │ ├── BiMamba.py
│ │ └── Spec_Mamba.py
│ ├── transformers_encoder
│ │ ├── Cross_Attention.py
│ │ ├── Embedding.py
│ │ └── position_embedding.py
│ └── ser_model.py
├── crossval_SER.py
├── data_utils.py
├── README.md
├── requirements.txt
└── train_ser.py
Recommended 3.11
pip install -r requirements.txt
You can download WAVLM-LARGE model on: https://huggingface.co/microsoft/wavlm-large, and modify the path in ./models/ser_model.py
Run the run_extract_features.py script and modify the parameters in the def parse_arguments(argv) function to implement feature extraction on different datasets. Files generated by feature extraction will be converted to .pkl format for subsequent training.
Run the crossval_SER.py script to execute the training process in the code. Various training parameters can be modified within this script.
If you use this code in your research, please cite our paper :
@article{dong2026cmtnet,
title={CMTNet: A Collaborative Mamba-Transformer Network with Spatial-Temporal Cross-Fusion for Speech Emotion Recognition},
author={Dong, Shihe and Wei, Jiajun and Zhao, Junfeng and Zhu, Yibing and Zhou, Jiayi and Shao, Zhuhong and Niu, Mingyue and Tan, Xiaohui and Jiang, Yinan and Qin, Rongyin},
journal={Pattern Recognition},
pages={113159},
year={2026},
publisher={Elsevier}
}
We would like to thank https://github.com/Vincent-ZHQ/CA-MSER for the valuable insights and inspiration.