model · Hugging Face
VioletVision-3B
An open vision-language model for captioning and VQA.
View full model profile →Compare with:
Want the primary source?View original →
paperViQ: Text-Aligned Visual Quantized Representations at Any ResolutionpaperPaying More Attention to Visual Tokens in Self-Evolving Large Multimodal ModelspaperAsk, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency RewardspaperJust how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQApaperHarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal ModelspaperTOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference
paperHarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal ModelspaperTOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM InferencepaperJust how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQApaperAsk, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency RewardspaperPaying More Attention to Visual Tokens in Self-Evolving Large Multimodal ModelspaperViQ: Text-Aligned Visual Quantized Representations at Any ResolutionpaperVision-language pretraining at scalerepovlm-starter