Scientific Literature Mining

AI for Scientific Literature Mining

We are redefining how AI interacts with scientific literature, evolving from paper retrieval to autonomous discovery of science through a 5-level cognitive hierarchy.

Our Vision

The 5-Level Cognitive Hierarchy

AI's capability in scientific literature mining follows a progressive path of cognitive depth.

L1

Paper Search

Retrieval from massive literature pools (e.g., PaperScout).

L2

Element Interpretation

Parsing PDF into structured, machine-readable data (e.g., ChemTable).

L3

Information Extraction

Understanding methods, tasks, and contributions (e.g., ScholarSum).

L4

Knowledge Comprehension

Reasoning and multi-paper synthesis (e.g., PaperArena, Mind2Report).

L5

Hypothesis Discovery

Autonomous discovery and research generation (Future Vision).

Research Projects

Advancing the Frontiers of Science

Our work spans the entire hierarchy, providing tools and benchmarks for the next generation of scientific AI.

L1 · Paper Search

PaperScout

An autonomous LLM-based agent that reformulates academic paper search as a multi-turn decision-making process, dynamically deciding when and how to invoke search and citation expansion tools.

Project Repo →
L2 · Element Interpretation

ChemTable

A large-scale benchmark for multimodal LLMs to recognize and understand complex chemical tables, combining symbolic chemical formulas, table structures, and visual molecule diagrams.

Project Repo →
L3 · Information Extraction

ScholarSum

Advancing scientific summarization through knowledge graph reasoning and reflective refinement, distilling complex research into structured, high-fidelity summaries.

OpenReview →
L4 · Knowledge Comprehension

PaperArena

An evaluation benchmark for tool-augmented agentic reasoning on scientific literature, requiring agents to integrate information across multiple papers with diverse tools.

Project Repo →
L4 · Knowledge Comprehension

Mind2Report

A cognitive deep research agent that emulates commercial analysts to synthesize expert-level reports from massive web sources through intent-driven search and iterative synthesis.

Project Repo →
L5 · Hypothesis Discovery

AI Scientist Vision

Our ultimate goal: building agents that not only understand literature but also autonomously discover laws, generate new hypotheses, and propose novel research directions.

Visionary Research

Timeline

Milestones in Scientific Literature Mining

Jun. 2025

ChemTable · Table Recognition Benchmark

A large-scale benchmark designed to test MLLMs on real-world chemical tables, addressing structural complexity and domain-specific semantics.

ArXiv →

Oct. 2025

ScholarSum · Student-Teacher Summarization

A novel Student-Teacher framework for scientific summarization, leveraging knowledge graph reasoning and reflective refinement to ensure structural coherence.

OpenReview →

Oct. 2025

PaperArena · Agentic Reasoning Benchmark

The first benchmark for tool-augmented agentic reasoning on scientific literature, evaluating cross-paper integration and multi-tool orchestration.

ArXiv →

Jan. 2026

PaperScout · Autonomous Paper Search

An adaptive agentic framework for academic paper search, optimized with Proximal Sequence Policy Optimization (PSPO) for multi-turn interactions.

ArXiv →

Jan. 2026

Mind2Report · Cognitive Deep Research

A cognitive deep research agent that emulates commercial analysts to synthesize expert-level reports via intent-driven search and iterative synthesis.

ArXiv →

Why AI for Scientific Literature Mining Matters?

The explosion of scientific literature has exceeded human cognitive limits. We aim to build AI systems that can autonomously navigate, understand, and synthesize scientific knowledge, ultimately accelerating the pace of human discovery.

Autonomous Reasoning

Moving beyond keyword matching to deep semantic understanding and multi-hop reasoning across heterogeneous scientific documents.

Multimodal Synthesis

Integrating text, tables, figures, and equations to form a holistic understanding of scientific breakthroughs.

Hypothesis Generation

Discovering novel connections between disparate research areas to propose and validate new scientific hypotheses.