Pricing write →

model · Hugging Face

VioletVision-3B

An open vision-language model for captioning and VQA.

View full model profile →

Compare with:

Want the primary source?View original →

repovlm-starter

paperVision-language pretraining at scale

paperViQ: Text-Aligned Visual Quantized Representations at Any Resolution paperPaying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models paperAsk, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards paperJust how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA paperHarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models paperTOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference

paperHarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models paperTOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference paperJust how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA paperAsk, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards paperPaying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models paperViQ: Text-Aligned Visual Quantized Representations at Any Resolution paperVision-language pretraining at scale repovlm-starter

vision multimodal