Skip to content

fix(deps): update dependency transformers to v5#317

Open
dreadnode-renovate-bot[bot] wants to merge 1 commit intomainfrom
renovate/transformers-5.x
Open

fix(deps): update dependency transformers to v5#317
dreadnode-renovate-bot[bot] wants to merge 1 commit intomainfrom
renovate/transformers-5.x

Conversation

@dreadnode-renovate-bot
Copy link
Contributor

@dreadnode-renovate-bot dreadnode-renovate-bot bot commented Jan 28, 2026

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

Package Change Age Confidence
transformers >=4.41.0,<5.0.0>=5.2.0,<5.3.0 age confidence

Release Notes

huggingface/transformers (transformers)

v5.2.0: : GLM-5, Qwen3.5, Voxtral Realtime, VibeVoice Acoustic Tokenizer

Compare Source

New Model additions

VoxtralRealtime
image

VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.

The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.

GLM-5 - GlmMoeDsa
image

The zAI team launches GLM-5, and introduces it as such:

GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.

Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.

Qwen3.5, Qwen3.5 Moe
image

The Qwen team launches Qwen 3.5, and introduces it as such:

We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.

VibeVoice Acoustic Tokenizer
image

VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.

One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.

Breaking changes

  • 🚨 [Attn] New attn mask interface everywhere (#​42848)
  • 🚨 Modify ModernBERT's default attention implementation to stop using FA (#​43764)

🚨 This one is quite breaking for super super super old modles: 🚨 🚨

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

v5.1.0: : EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM, GLM-OCR

Compare Source

New Model additions

EXAONE-MoE
image

K-EXAONE is a large-scale multilingual language model developed by LG AI Research. Built using a Mixture-of-Experts architecture, K-EXAONE features 236 billion total parameters, with 23 billion active during inference. Performance evaluations across various benchmarks demonstrate that K-EXAONE excels in reasoning, agentic capabilities, general knowledge, multilingual understanding, and long-context processing.

PP-DocLayoutV3
image

PP-DocLayoutV3 is a unified and high-efficiency model designed for comprehensive layout analysis. It addresses the challenges of complex physical distortions—such as skewing, curving, and adverse lighting—by integrating instance segmentation and reading order prediction into a single, end-to-end framework.

Youtu-LLM
image

Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.

GlmOcr
image

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

Breaking changes

  • 🚨 T5Gemma2 model structure (#​43633) - Makes sure that the attn implementation is set to all sub-configs. The config.encoder.text_config was not getting its attn set because we aren't passing it to PreTrainedModel.init. We can't change the model structure without breaking so I manually re-added a call to self.adjust_attn_implemetation in modeling code

  • 🚨 Generation cache preparation (#​43679) - Refactors cache initialization in generation to ensure sliding window configurations are now properly respected. Previously, some models (like Afmoe) created caches without passing the model config, causing sliding window limits to be ignored. This is breaking because models with sliding window attention will now enforce their window size limits during generation, which may change generation behavior or require adjusting sequence lengths in existing code.

  • 🚨 Delete duplicate code in backbone utils (#​43323) - This PR cleans up backbone utilities. Specifically, we have currently 5 different config attr to decide which backbone to load, most of which can be merged into one and seem redundant
    After this PR, we'll have only one config.backbone_config as a single source of truth. The models will load the backbone from_config and load pretrained weights only if the checkpoint has any weights saved. The overall idea is same as in other composite models. A few config arguments are removed as a result.

  • 🚨 Refactor DETR to updated standards (#​41549) - standardizes the DETR model to be closer to other vision models in the library.

  • 🚨Fix floating-point precision in JanusImageProcessor resize (#​43187) - replaces an int() with round(), expect light numerical differences

  • 🚨 Remove deprecated AnnotionFormat (#​42983) - removes a missnamed class in favour of AnnotationFormat.

Bugfixes and improvements


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

@dreadnode-renovate-bot dreadnode-renovate-bot bot added area/python Changes to Python package configuration and dependencies type/digest Dependency digest updates labels Jan 28, 2026
@dreadnode-renovate-bot dreadnode-renovate-bot bot force-pushed the renovate/transformers-5.x branch from 9f77e00 to 157a706 Compare February 11, 2026 00:31
| datasource | package      | from   | to    |
| ---------- | ------------ | ------ | ----- |
| pypi       | transformers | 4.57.1 | 5.2.0 |
@dreadnode-renovate-bot dreadnode-renovate-bot bot force-pushed the renovate/transformers-5.x branch from 157a706 to b5dd341 Compare February 18, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/python Changes to Python package configuration and dependencies type/digest Dependency digest updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants