Skip to main content

Xiaohongshu's Open-Source Multimodal Model Rivals Top AI

Xiaohongshu's Open-Source Multimodal Model Challenges Industry Leaders

Chinese social media platform Xiaohongshu has entered the AI arms race with the release of dots.vlm1, its first self-developed multimodal large model. The open-source system combines a 1.2B parameter NaViT visual encoder with the DeepSeek V3 large language model, achieving performance comparable to proprietary models like Google's Gemini2.5Pro.

Image

Native Architecture Breaks New Ground

The model's standout feature is its completely self-developed architecture, trained from scratch rather than fine-tuned from existing models. The NaViT encoder supports dynamic resolution processing, allowing superior handling of real-world image variability. Through dual supervision combining pure visual and text-visual training, the system demonstrates exceptional capability with non-standard content including:

  • Tables and charts
  • Mathematical formulas
  • Document structures

"We rebuilt our entire training pipeline," explained the Hi Lab team. "From data collection using our dots.ocr tool for PDF processing to manual rewriting of web-sourced text, every component was optimized for cross-modal understanding."

Benchmark Performance Analysis

In rigorous testing across international evaluation sets, dots.vlm1 shows remarkable results:

BenchmarkPerformance Level

The model particularly shines in complex analytical tasks, solving Olympiad-level math problems and demonstrating strong STEM reasoning capabilities. While trailing slightly in advanced textual reasoning, its mathematical and coding performance equals leading LLMs.

Image

Future Development Roadmap

The Hi Lab team outlined three key focus areas for future development:

  1. Data expansion: Scaling cross-modal training datasets
  2. Algorithm enhancement: Implementing reinforcement learning techniques
  3. Reasoning improvement: Boosting generalization capabilities

By open-sourcing dots.vlm1, Xiaohongshu aims to stimulate innovation in the multimodal AI space while establishing itself as a serious player in foundational model development.

Key Points:

  • First complete open-source multimodal model from Xiaohongshu
  • Native NaViT encoder handles dynamic resolution natively
  • Matches proprietary models in 6/8 benchmark categories
  • Exceptional performance on STEM and analytical tasks
  • Planned enhancements through RL and data scaling

Enjoyed this article?

Subscribe to our newsletter for the latest AI news, product reviews, and project recommendations delivered to your inbox weekly.

Weekly digestFree foreverUnsubscribe anytime

Related Articles

Xiaohongshu Unveils Faster AI Image Editor With Major Upgrades
News

Xiaohongshu Unveils Faster AI Image Editor With Major Upgrades

China's lifestyle platform Xiaohongshu has turbocharged its AI image editing capabilities with FireRed-Image-Edit v1.1. The update brings smarter facial recognition, smoother multi-element blending, and dramatic performance boosts - cutting processing time nearly in half. In a surprise move, the company is releasing all code and technical specs publicly, giving developers worldwide access to these professional-grade tools.

March 9, 2026
AI image editingXiaohongshucomputer vision
Grok 4.20 Takes Aim at AI's Biggest Flaw: Making Stuff Up
News

Grok 4.20 Takes Aim at AI's Biggest Flaw: Making Stuff Up

While competitors chase raw performance, Elon Musk's xAI has released Grok 4.20 with a surprising focus - telling the truth. The new model sets industry records for factual accuracy while admitting when it doesn't know answers. With three specialized modes and competitive pricing, Grok could become the go-to AI for businesses needing reliable information.

March 13, 2026
xAIAI ethicslarge language models
News

ChatGPT Gets a Video Upgrade: OpenAI Merges Sora to Boost Creativity

OpenAI is shaking things up by bringing its Sora video generator directly into ChatGPT. This bold move aims to supercharge the platform's creative tools while helping OpenAI reach its ambitious goal of 1 billion weekly users. But merging these powerful AI technologies won't come cheap - the company expects astronomical computing costs exceeding $225 billion through 2030.

March 11, 2026
OpenAIChatGPTAI video
Microsoft's New AI Model Thinks Like Humans - Decides When to Go Deep
News

Microsoft's New AI Model Thinks Like Humans - Decides When to Go Deep

Microsoft just unveiled Phi-4-reasoning-vision-15B, an open-source AI model that mimics human decision-making by choosing when to think deeply. Unlike typical models that require manual mode switching, this 15-billion-parameter wonder automatically adjusts its reasoning depth based on task complexity. Excelling in image analysis and math problems while using surprisingly little training data, it could revolutionize how we deploy lightweight AI systems.

March 5, 2026
AI innovationMicrosoft Researchlightweight models
Google's Flow Gets Major Upgrade with Nano Banana Model and Veo Integration
News

Google's Flow Gets Major Upgrade with Nano Banana Model and Veo Integration

Google has unveiled a significant update to its AI creative studio Flow, merging experimental projects Whisk and ImageFX into a unified platform. The highlight is the new Nano Banana image model that seamlessly connects to Veo video workflows. With enhanced editing tools and media management features, Google aims to streamline creative production while strengthening its competitive edge against rivals like OpenAI.

February 26, 2026
AI creativityGoogle updatesmultimodal AI
News

China's AI Boom: Enterprise Adoption of Large Models Triples

Chinese companies are racing to adopt AI large models at unprecedented speed, with usage skyrocketing 263% in just six months. Alibaba Cloud's Qwen leads the pack with a third of the market, while ByteDance and dark horse DeepSeek complete an emerging 'big three' reshaping China's AI landscape.

February 24, 2026
AI adoptionChinese techenterprise technology