We spent weeks testing text vs. image retrieval for RAG. The winner? 𝗡𝗲𝗶𝘁𝗵𝗲𝗿. Our recent pu...
- 文本检索在Recall@1略优于图像,但图像在深层召回中表现相当甚至更好。
- 文本和图像方法在不同查询上失败,说明单一模态存在互补盲区。
- 多模态混合检索融合两者优势,Recall@1提升至49%,整体效果最优。
The winner? 𝗡𝗲𝗶𝘁𝗵𝗲𝗿.
Our recent publication, IRPAPERS, compares 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (OCR + vector, keyword, and hybrid search) and 𝗶𝗺𝗮𝗴𝗲-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (multimodal late https://t.co/rzMKvOtIoo" / X
We spent weeks testing text vs. image retrieval for RAG. The winner? 𝗡𝗲𝗶𝘁𝗵𝗲𝗿. Our recent publication, IRPAPERS, compares 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (OCR + vector, keyword, and hybrid search) and 𝗶𝗺𝗮𝗴𝗲-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (multimodal late interaction with ColModernVBERT) to see which one ranks best for PDF search tasks. The benchmark tested retrieval over 3,230 pages from 166 scientific papers using 180 needle-in-the-haystack queries. Here's what we found: 𝗧𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (Arctic 2.0 + BM25 hybrid search): • 46% Recall@1 • 78% Recall@5 • 91% Recall@20 𝗜𝗺𝗮𝗴𝗲-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (ColModernVBERT multi-vector embeddings): • 43% Recall@1 • 78% Recall@5 • 93% Recall@20 Text edges out images at the top rank, but images match or exceed text at deeper recall levels. But the most interesting thing we found is that 𝘁𝗵𝗲𝘀𝗲 𝗮𝗽𝗽𝗿𝗼𝗮𝗰𝗵𝗲𝘀 𝗳𝗮𝗶𝗹 𝗼𝗻 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝗾𝘂𝗲𝗿𝗶𝗲𝘀. At Recall@1: • 22 queries succeeded with text but failed with images • 18 queries succeeded with images but failed with text This means neither 𝘢𝘭𝘰𝘯𝘦 is necessarily better than the other. Text excels at lexical precision and top-rank accuracy. Images preserve visual structure and spatial relationships that text transcription loses. 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗵𝘆𝗯𝗿𝗶𝗱 𝘀𝗲𝗮𝗿𝗰𝗵 fuses both text- and image-based methods together, giving the best results overall: • 49% Recall@1 (+3 points over text alone) • 81% Recall@5 • 95% Recall@20 Hybrid search gives you the best of both worlds - where one method fails, the other might succeed. Watch the full video by