-
Notifications
You must be signed in to change notification settings - Fork 4
Feature/literature mcp #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…earch (skip pre-commit due to Pyright false positive on __getattr__)
…earch (skip pre-commit due to Pyright false positive on __getattr__)
…/Xyzen into feature/literature-MCP
审查者指南引入了一个新的文献检索子系统(模型、基础客户端、OpenAlex 客户端、DOI 规范化/去重、多源工作分发器,以及一个 MCP 服务器入口点),并配套了大量单元测试,同时移除了一个已废弃的内置工具测试;若干新增测试文件中仍然包含未解决的合并冲突标记,必须在合并前修复。 文献 MCP 检索流程的时序图sequenceDiagram
actor User
participant LLM
participant FastMCP_literature as FastMCP_literature_tool
participant WorkDistributor
participant OpenAlexClient
participant OpenAlexAPI as OpenAlex_API
User->>LLM: Ask for literature search
LLM->>FastMCP_literature: Call search_literature(parameters)
FastMCP_literature->>FastMCP_literature: Parse and normalize parameters
FastMCP_literature->>FastMCP_literature: Build SearchRequest
FastMCP_literature->>WorkDistributor: create(openalex_email)
FastMCP_literature->>WorkDistributor: search(SearchRequest)
WorkDistributor->>WorkDistributor: _register_clients()
WorkDistributor->>OpenAlexClient: search(SearchRequest)
OpenAlexClient->>OpenAlexClient: _build_query_params(...)
loop optional_id_resolution
OpenAlexClient->>OpenAlexClient: _resolve_author_id()
OpenAlexClient->>OpenAlexClient: _resolve_institution_id()
OpenAlexClient->>OpenAlexClient: _resolve_source_id()
end
loop pagination
OpenAlexClient->>OpenAlexClient: _fetch_all_pages()
loop per_page_request
OpenAlexClient->>OpenAlexClient: _request_with_retry(works_endpoint)
OpenAlexClient->>OpenAlexAPI: GET /works
OpenAlexAPI-->>OpenAlexClient: JSON page
end
end
OpenAlexClient->>OpenAlexClient: _transform_work() for each item
OpenAlexClient-->>WorkDistributor: list~LiteratureWork~, warnings
WorkDistributor->>WorkDistributor: deduplicate_by_doi()
WorkDistributor->>WorkDistributor: _sort_works()
WorkDistributor-->>FastMCP_literature: aggregated_result
FastMCP_literature->>FastMCP_literature: _format_search_result()
FastMCP_literature-->>LLM: Markdown report
LLM-->>User: Present formatted results
新文献检索子系统的类图classDiagram
class SearchRequest {
+str query
+str author
+str institution
+str source
+int year_from
+int year_to
+bool is_oa
+str work_type
+str language
+bool is_retracted
+bool has_abstract
+bool has_fulltext
+str sort_by
+int max_results
+list~str~ data_sources
}
class LiteratureWork {
+str id
+str doi
+str title
+list~dict~ authors
+int publication_year
+int cited_by_count
+str abstract
+str journal
+bool is_oa
+str oa_url
+str source
+dict raw_data
}
class BaseLiteratureClient {
<<abstract>>
+search(request SearchRequest) async tuple~list~LiteratureWork~~ list~str~~
}
class _RateLimiter {
-float _min_interval
-Lock _lock
-float _last_request
-Semaphore _semaphore
+__aenter__() async
+__aexit__(exc_type type~BaseException~, exc BaseException, tb Any) async
-_throttle() async
}
class OpenAlexClient {
+str BASE_URL
+int MAX_PER_PAGE
+int MAX_RETRIES
+float TIMEOUT
-str email
-int rate_limit
-_RateLimiter rate_limiter
-AsyncClient client
+OpenAlexClient(email str, rate_limit int, timeout float)
+property pool_type str
+search(request SearchRequest) async tuple~list~LiteratureWork~~ list~str~~
-_build_query_params(request SearchRequest, author_id str, institution_id str, source_id str) dict~str,str~
-_resolve_author_id(author_name str) async tuple~str,bool,str~
-_resolve_institution_id(institution_name str) async tuple~str,bool,str~
-_resolve_source_id(source_name str) async tuple~str,bool,str~
-_fetch_all_pages(params dict~str,str~, max_results int) async list~dict~str,Any~~
-_request_with_retry(url str, params dict~str,str~) async dict~str,Any~
-_transform_work(work dict~str,Any~) LiteratureWork
-_reconstruct_abstract(inverted_index dict~str,list~int~~) str
+close() async
+__aenter__() async OpenAlexClient
+__aexit__(exc_type type~BaseException~, exc BaseException, tb Any) async
}
class WorkWithDOI {
<<protocol>>
+str doi
+int cited_by_count
+int publication_year
}
class DoiCleanerModule {
+normalize_doi(doi str) str
+deduplicate_by_doi(works list~WorkWithDOI~) list~WorkWithDOI~
}
class WorkDistributor {
-dict~str,Any~ clients
-str openalex_email
+WorkDistributor(openalex_email str)
-_register_clients() void
+close() async
+__aenter__() async WorkDistributor
+__aexit__(exc_type type~BaseException~, exc BaseException, tb Any) async
+search(request SearchRequest) async dict~str,Any~
-_sort_works(works list~LiteratureWork~, sort_by str) list~LiteratureWork~
}
BaseLiteratureClient <|-- OpenAlexClient
WorkDistributor o--> OpenAlexClient
OpenAlexClient ..> _RateLimiter
WorkDistributor ..> SearchRequest
WorkDistributor ..> LiteratureWork
OpenAlexClient ..> SearchRequest
OpenAlexClient ..> LiteratureWork
DoiCleanerModule ..> WorkWithDOI
WorkDistributor ..> DoiCleanerModule
文件级变更
提示与命令与 Sourcery 交互
自定义使用体验前往你的 控制面板 可以:
获取帮助Original review guide in EnglishReviewer's GuideIntroduces a new literature search subsystem (models, base client, OpenAlex client, DOI normalization/deduplication, multi-source work distributor, and an MCP server entrypoint) along with extensive unit tests, while removing an obsolete built-in tools test; several new tests contain unresolved merge conflict markers that must be fixed. Sequence diagram for literature MCP search flowsequenceDiagram
actor User
participant LLM
participant FastMCP_literature as FastMCP_literature_tool
participant WorkDistributor
participant OpenAlexClient
participant OpenAlexAPI as OpenAlex_API
User->>LLM: Ask for literature search
LLM->>FastMCP_literature: Call search_literature(parameters)
FastMCP_literature->>FastMCP_literature: Parse and normalize parameters
FastMCP_literature->>FastMCP_literature: Build SearchRequest
FastMCP_literature->>WorkDistributor: create(openalex_email)
FastMCP_literature->>WorkDistributor: search(SearchRequest)
WorkDistributor->>WorkDistributor: _register_clients()
WorkDistributor->>OpenAlexClient: search(SearchRequest)
OpenAlexClient->>OpenAlexClient: _build_query_params(...)
loop optional_id_resolution
OpenAlexClient->>OpenAlexClient: _resolve_author_id()
OpenAlexClient->>OpenAlexClient: _resolve_institution_id()
OpenAlexClient->>OpenAlexClient: _resolve_source_id()
end
loop pagination
OpenAlexClient->>OpenAlexClient: _fetch_all_pages()
loop per_page_request
OpenAlexClient->>OpenAlexClient: _request_with_retry(works_endpoint)
OpenAlexClient->>OpenAlexAPI: GET /works
OpenAlexAPI-->>OpenAlexClient: JSON page
end
end
OpenAlexClient->>OpenAlexClient: _transform_work() for each item
OpenAlexClient-->>WorkDistributor: list~LiteratureWork~, warnings
WorkDistributor->>WorkDistributor: deduplicate_by_doi()
WorkDistributor->>WorkDistributor: _sort_works()
WorkDistributor-->>FastMCP_literature: aggregated_result
FastMCP_literature->>FastMCP_literature: _format_search_result()
FastMCP_literature-->>LLM: Markdown report
LLM-->>User: Present formatted results
Class diagram for new literature search subsystemclassDiagram
class SearchRequest {
+str query
+str author
+str institution
+str source
+int year_from
+int year_to
+bool is_oa
+str work_type
+str language
+bool is_retracted
+bool has_abstract
+bool has_fulltext
+str sort_by
+int max_results
+list~str~ data_sources
}
class LiteratureWork {
+str id
+str doi
+str title
+list~dict~ authors
+int publication_year
+int cited_by_count
+str abstract
+str journal
+bool is_oa
+str oa_url
+str source
+dict raw_data
}
class BaseLiteratureClient {
<<abstract>>
+search(request SearchRequest) async tuple~list~LiteratureWork~~ list~str~~
}
class _RateLimiter {
-float _min_interval
-Lock _lock
-float _last_request
-Semaphore _semaphore
+__aenter__() async
+__aexit__(exc_type type~BaseException~, exc BaseException, tb Any) async
-_throttle() async
}
class OpenAlexClient {
+str BASE_URL
+int MAX_PER_PAGE
+int MAX_RETRIES
+float TIMEOUT
-str email
-int rate_limit
-_RateLimiter rate_limiter
-AsyncClient client
+OpenAlexClient(email str, rate_limit int, timeout float)
+property pool_type str
+search(request SearchRequest) async tuple~list~LiteratureWork~~ list~str~~
-_build_query_params(request SearchRequest, author_id str, institution_id str, source_id str) dict~str,str~
-_resolve_author_id(author_name str) async tuple~str,bool,str~
-_resolve_institution_id(institution_name str) async tuple~str,bool,str~
-_resolve_source_id(source_name str) async tuple~str,bool,str~
-_fetch_all_pages(params dict~str,str~, max_results int) async list~dict~str,Any~~
-_request_with_retry(url str, params dict~str,str~) async dict~str,Any~
-_transform_work(work dict~str,Any~) LiteratureWork
-_reconstruct_abstract(inverted_index dict~str,list~int~~) str
+close() async
+__aenter__() async OpenAlexClient
+__aexit__(exc_type type~BaseException~, exc BaseException, tb Any) async
}
class WorkWithDOI {
<<protocol>>
+str doi
+int cited_by_count
+int publication_year
}
class DoiCleanerModule {
+normalize_doi(doi str) str
+deduplicate_by_doi(works list~WorkWithDOI~) list~WorkWithDOI~
}
class WorkDistributor {
-dict~str,Any~ clients
-str openalex_email
+WorkDistributor(openalex_email str)
-_register_clients() void
+close() async
+__aenter__() async WorkDistributor
+__aexit__(exc_type type~BaseException~, exc BaseException, tb Any) async
+search(request SearchRequest) async dict~str,Any~
-_sort_works(works list~LiteratureWork~, sort_by str) list~LiteratureWork~
}
BaseLiteratureClient <|-- OpenAlexClient
WorkDistributor o--> OpenAlexClient
OpenAlexClient ..> _RateLimiter
WorkDistributor ..> SearchRequest
WorkDistributor ..> LiteratureWork
OpenAlexClient ..> SearchRequest
OpenAlexClient ..> LiteratureWork
DoiCleanerModule ..> WorkWithDOI
WorkDistributor ..> DoiCleanerModule
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - 我发现了 8 个问题,并给出了一些整体反馈:
- 在多个新测试文件中(例如
test_openalex_client.py和test_base_client.py)还残留着未解决的合并冲突标记(例如<<<<<<< HEAD/>>>>>>>);在合并之前需要解决这些冲突,并保留预期的断言/代码。 - 测试中为
BaseLiteratureClient.get_by_doi/get_by_id以及对应的 OpenAlexClient 行为添加了预期,但BaseLiteratureClient和OpenAlexClient的实现目前只定义了search;因此要么删除/调整这些测试,要么在代码中增加相应的抽象方法和具体实现。
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- 在多个新测试文件中(例如 `test_openalex_client.py` 和 `test_base_client.py`)还残留着未解决的合并冲突标记(例如 `<<<<<<< HEAD` / `>>>>>>>`);在合并之前需要解决这些冲突,并保留预期的断言/代码。
- 测试中为 `BaseLiteratureClient.get_by_doi` / `get_by_id` 以及对应的 OpenAlexClient 行为添加了预期,但 `BaseLiteratureClient` 和 `OpenAlexClient` 的实现目前只定义了 `search`;因此要么删除/调整这些测试,要么在代码中增加相应的抽象方法和具体实现。
## Individual Comments
### Comment 1
<location> `service/app/utils/literature/openalex_client.py:266-275` </location>
<code_context>
+ sections.append("## Complete Works List (JSON)\n")
+ if include_abstract:
+ sections.append("The following JSON contains all works with full abstracts:\n")
+ else:
+ sections.append("The following JSON contains all works (abstracts excluded to save tokens):\n")
+ sections.append("```json")
</code_context>
<issue_to_address>
**issue (bug_risk):** 除 403 以外的 4xx 错误仍会被重试,这与注释所声称的行为不一致。
由于 `response.raise_for_status()` 被包装在一个通用的 `except Exception` 中,任何非 403 的 4xx 目前都会进入重试循环,这和“不要在 4xx(403 除外)时重试”的约定相矛盾。请要么在 4xx 时直接短路(例如对 `400 <= status_code < 500 且 status_code != 403` 直接抛出/返回、不再重试),要么收窄 `except` 的范围,使 `raise_for_status()` 抛出的错误不会被当作瞬时错误来重试。
</issue_to_address>
### Comment 2
<location> `service/app/utils/literature/work_distributor.py:90-95` </location>
<code_context>
+ """
+ # Clamp max_results to 50/1000 with warnings
+ all_warnings: list[str] = []
+ if request.max_results < 1:
+ all_warnings.append("⚠️ max_results < 1; using default 50")
+ request.max_results = 50
+ elif request.max_results > 1000:
+ all_warnings.append("⚠️ max_results > 1000; using 1000")
+ request.max_results = 1000
+
+ # Determine which data sources to use
</code_context>
<issue_to_address>
**suggestion (bug_risk):** 在分发器内部原地修改 `request.max_results` 可能会给调用方带来意料之外的副作用。
由于分发器会回写到共享的 `SearchRequest` 对象中,任何重用该实例的调用方都会看到与最初设置不同的 `max_results`,这既隐蔽又难以排查。建议派生出一个本地的 `max_results`(或在一个浅拷贝上操作),而不是修改传入的请求对象。
Suggested implementation:
```python
# Clamp max_results to 50/1000 with warnings, without mutating the request
all_warnings: list[str] = []
max_results = request.max_results
if max_results < 1:
all_warnings.append("⚠️ max_results < 1; using default 50")
max_results = 50
elif max_results > 1000:
all_warnings.append("⚠️ max_results > 1000; using 1000")
max_results = 1000
# Determine which data sources to use
sources = request.data_sources or ["openalex"]
```
在本方法后面任何使用 `request.max_results` 控制分页或限制结果数量的地方,都应改为使用本地的 `max_results` 变量。例如,类似 `client.search(request, ...)` 的调用如果依赖 `request.max_results`,要么显式把 `max_results` 作为参数传入,要么构造一个带有调整后 `max_results` 的请求浅拷贝,以保证对调用方来说原始对象保持不变。
</issue_to_address>
### Comment 3
<location> `service/app/utils/literature/openalex_client.py:274-283` </location>
<code_context>
+ except httpx.HTTPError as e:
+ logger.error(f"Literature search network error: {e}", exc_info=True)
+ return "❌ Network error while contacting literature sources. Please try again later."
+ except Exception as e:
+ logger.error(f"Literature search failed: {e}", exc_info=True)
+ return "❌ Unexpected error during search. Please retry or contact support."
</code_context>
<issue_to_address>
**suggestion (bug_risk):** 抓取页面时的错误会被吞掉,可能在没有任何显式警告的情况下返回部分结果。
当前 `_fetch_all_pages` 会记录异常并跳出循环,只返回目前已获取到的页面。为避免将“部分结果”在静默情况下当作“完整搜索成功”来处理,要么让异常向上传递,以便调用方显式处理,要么返回一个结构化的结果,其中既包含 works,又包含明确的“部分结果/错误”标记,供调用方对外展示。
</issue_to_address>
### Comment 4
<location> `service/app/mcp/literature.py:261-262` </location>
<code_context>
+ conditions.append(f"- **Work Type**: {request.work_type}")
+ if request.language:
+ conditions.append(f"- **Language**: {request.language}")
+ if request.is_retracted is not None:
+ conditions.append(f"- **Exclude Retracted**: {'No' if request.is_retracted else 'Yes'}")
+ if request.has_abstract is not None:
+ conditions.append(f"- **Require Abstract**: {'Yes' if request.has_abstract else 'No'}")
</code_context>
<issue_to_address>
**suggestion:** “Exclude Retracted” 这个标签与 `is_retracted` 标志位的语义相反/容易令人困惑。
根据 docstring,`is_retracted=True` 表示“只要已撤稿的文献”,`False` 表示“排除已撤稿”。在当前渲染方式下,当 `is_retracted=True` 时会显示 `Exclude Retracted: No`,而 `False` 时显示 `Yes`,虽然在逻辑上自洽,但很难理解。建议重命名/重写这一行,用真正的过滤模式来描述(例如 `Retracted Filter: only retracted / exclude retracted`),而不是用一个语义反转的 “Exclude” 标志来表达。
```suggestion
if request.is_retracted is not None:
retracted_mode = "Only retracted" if request.is_retracted else "Exclude retracted"
conditions.append(f"- **Retracted Filter**: {retracted_mode}")
```
</issue_to_address>
### Comment 5
<location> `service/tests/unit/test_literature/test_openalex_client.py:126-129` </location>
<code_context>
+
+ assert len(works) == 1
+ mock_resolve.assert_called_once_with("Jane Smith")
+<<<<<<< HEAD:service/tests/unit/test_literature/test_openalex_client.py
+ assert any("Author resolved" in msg for msg in warnings)
+=======
+>>>>>>> 1794485dfcc48ad6b089f2b31eb788a021eadea5:service/tests/unit/test_utils/test_openalex_client.py
+
+ @pytest.mark.asyncio
</code_context>
<issue_to_address>
**issue (bug_risk):** 需要解决 OpenAlex client 测试中的合并冲突标记,以恢复有效的测试模块。
该测试文件中仍然存在未解决的合并冲突标记(大致位于 author/institution/source 过滤测试附近,以及文件后面的位置)。在当前状态下,文件无法通过解析,pytest 也无法收集这些测试。请解决这些冲突(选择你想要保留的断言,例如 `assert any(... "Author resolved" ...)` 检查,或按需恢复被删除的测试),并确保文件在合并前语法正确。
</issue_to_address>
### Comment 6
<location> `service/tests/unit/test_literature/test_base_client.py:16-25` </location>
<code_context>
+ """Dummy search implementation."""
+ return [], []
+
+<<<<<<< HEAD:service/tests/unit/test_literature/test_base_client.py
+=======
+ async def get_by_doi(self, doi: str) -> LiteratureWork | None:
+ """Dummy get_by_doi implementation."""
+ return None
+
+ async def get_by_id(self, work_id: str) -> LiteratureWork | None:
+ """Dummy get_by_id implementation."""
+ return None
+
+>>>>>>> 1794485dfcc48ad6b089f2b31eb788a021eadea5:service/tests/unit/test_utils/test_base_client.py
+
+class TestBaseLiteratureClientProtocol:
</code_context>
<issue_to_address>
**issue (testing):** 修复 BaseLiteratureClient 测试中的合并冲突,使其与当前的抽象接口保持一致。
该文件仍然包含合并冲突标记,并有两个互相冲突的 `ConcreteClient` 版本。当前的 `BaseLiteratureClient` 只定义了 `search`,因此依赖 `get_by_doi`/`get_by_id` 是抽象方法的测试不再有效。请通过保留一个与当前基类保持一致的 `ConcreteClient` 实现来解决冲突,并删除或调整那些基于已不再属于抽象接口的方法所写的测试。
</issue_to_address>
### Comment 7
<location> `service/app/utils/literature/openalex_client.py:239` </location>
<code_context>
+
+ return params
+
+ async def _resolve_author_id(self, author_name: str) -> tuple[str | None, bool, str]:
+ """
+ Two-step lookup: author name -> author ID
</code_context>
<issue_to_address>
**issue (complexity):** 可以考虑通过抽取公共辅助函数和数据驱动映射来重构 OpenAlex client,在保持行为完全一致的前提下去除重复的请求和过滤构造逻辑。
你可以在不改变行为的前提下,通过引入一些小的辅助函数和数据驱动映射,显著减少重复。
### 1. 将多个 `_resolve_*_id` 统一为一个通用辅助函数
三个解析方法的结构几乎完全相同。你可以把公共逻辑抽到一个 helper 中,然后保留简洁、可读性好的包装函数:
```python
async def _resolve_entity_id(
self,
*,
kind: str, # "Author" / "Institution" / "Source"
endpoint: str, # "authors" / "institutions" / "sources"
name: str,
not_found_suggestions: str,
) -> tuple[str | None, bool, str]:
async with self.rate_limiter:
try:
url = f"{self.BASE_URL}/{endpoint}"
params: dict[str, str] = {"search": name}
if self.email:
params["mailto"] = self.email
response = await self._request_with_retry(url, params)
if results := response.get("results", []):
entity = results[0]
entity_id = entity["id"].split("/")[-1]
display = entity.get("display_name", name)
logger.info(f"Resolved {kind.lower()} '{name}' -> {entity_id}")
return entity_id, True, f"✓ {kind} resolved: '{name}' -> '{display}'"
msg = f"⚠️ {kind} '{name}' not found. {not_found_suggestions}"
logger.warning(msg)
return None, False, msg
except Exception as e:
msg = f"⚠️ Failed to resolve {kind.lower()} '{name}': {e}"
logger.warning(msg)
return None, False, msg
```
然后具体的解析函数就可以变得简短且自说明:
```python
async def _resolve_author_id(self, author_name: str) -> tuple[str | None, bool, str]:
return await self._resolve_entity_id(
kind="Author",
endpoint="authors",
name=author_name,
not_found_suggestions=(
"Suggestions: (1) Try full name format like 'Smith, John' or 'John Smith', "
"(2) Check spelling, (3) Try removing middle name/initial."
),
)
async def _resolve_institution_id(self, institution_name: str) -> tuple[str | None, bool, str]:
return await self._resolve_entity_id(
kind="Institution",
endpoint="institutions",
name=institution_name,
not_found_suggestions=(
"Suggestions: (1) Use full official name (e.g., 'Harvard University' not 'Harvard'), "
"(2) Try variations (e.g., 'MIT' vs 'Massachusetts Institute of Technology'), "
"(3) Check spelling."
),
)
async def _resolve_source_id(self, source_name: str) -> tuple[str | None, bool, str]:
return await self._resolve_entity_id(
kind="Source/Journal",
endpoint="sources",
name=source_name,
not_found_suggestions=(
"Suggestions: (1) Use full journal name (e.g., 'Nature' or 'Science'), "
"(2) Try alternative names (e.g., 'JAMA' vs 'Journal of the American Medical Association'), "
"(3) Check spelling."
),
)
```
这样可以在保持所有消息和行为不变的前提下,移除大约一百行重复代码。
---
### 2. 将限流和重试封装到一个 `_get_json` helper 里
你已经有 `_request_with_retry`;再在其外层统一包一层限流逻辑可以简化所有调用点:
```python
async def _get_json(self, path: str, params: dict[str, str]) -> dict[str, Any]:
async with self.rate_limiter:
if self.email and "mailto" not in params:
params = {**params, "mailto": self.email}
url = f"{self.BASE_URL}/{path}"
return await self._request_with_retry(url, params)
```
这样,例如 `_fetch_all_pages` 就可以写成:
```python
async def _fetch_all_pages(self, params: dict[str, str], max_results: int) -> list[dict[str, Any]]:
all_works: list[dict[str, Any]] = []
page = 1
while len(all_works) < max_results:
try:
page_params = {**params, "page": str(page)}
response = await self._get_json("works", page_params)
works = response.get("results", [])
if not works:
break
all_works.extend(works)
logger.info(f"Fetched page {page}: {len(works)} works")
meta = response.get("meta", {})
total_count = meta.get("count", 0)
if len(all_works) >= total_count:
break
page += 1
except Exception as e:
logger.error(f"Error fetching page {page}: {e}")
break
return all_works[:max_results]
```
同时,上面通用的解析函数也可以调用 `_get_json(endpoint, params)`,而无需自己构造 URL 并使用 `async with self.rate_limiter:`。
---
### 3. 让 `_build_query_params` 更加声明式
过滤条件的拼接可以通过小型映射来写得更短、更直观。例如,布尔类型过滤:
```python
bool_filters = {
"is_oa": request.is_oa,
"is_retracted": request.is_retracted,
"has_abstract": request.has_abstract,
"has_fulltext": request.has_fulltext,
}
for field, value in bool_filters.items():
if value is not None:
filters.append(f"{field}:{str(value).lower()}")
```
简单标量过滤也类似:
```python
scalar_filters = {
"type": request.work_type,
"language": request.language,
}
for field, value in scalar_filters.items():
if value:
filters.append(f"{field}:{value}")
```
基于 ID 的过滤:
```python
id_filters = {
"authorships.author.id": author_id,
"authorships.institutions.id": institution_id,
"primary_location.source.id": source_id,
}
for field, value in id_filters.items():
if value:
filters.append(f"{field}:{value}")
```
综合起来,`_build_query_params` 可以在保持语义不变的情况下减少分支和重复,使得未来在 `SearchRequest` 字段与 OpenAlex 过滤器之间进行映射调整时更容易维护。
</issue_to_address>
### Comment 8
<location> `service/app/mcp/literature.py:139` </location>
<code_context>
+ year_warning += f"year_to {year_to_int}→1700. "
+ year_to_int = 1700
+
+ # Convert is_oa to boolean
+ is_oa_bool: bool | None = None
+ if is_oa is not None:
</code_context>
<issue_to_address>
**issue (complexity):** 可以考虑抽取参数规范化、布尔解析以及 work → dict 转换的公共辅助函数,在不改变行为的前提下简化 `search_literature` 和 `_format_search_result`。
你可以通过抽出几个小 helper,在保持行为不变的前提下减少复杂度和重复代码。
### 1. 去重字符串 → 布尔的转换逻辑
当前你对 `is_oa`、`is_retracted`、`has_abstract`、`has_fulltext` 和 `include_abstract` 都重复了一样的转换模式。可以提炼出一个统一工具函数:
```python
def _str_to_bool(value: str | bool | None) -> bool | None:
if value is None:
return None
if isinstance(value, bool):
return value
return str(value).lower() in ("true", "1", "yes")
```
然后在 `search_literature` 中使用它:
```python
# Convert is_oa to boolean
is_oa_bool = _str_to_bool(is_oa)
# Convert is_retracted to boolean
is_retracted_bool = _str_to_bool(is_retracted)
# Convert has_abstract to boolean
has_abstract_bool = _str_to_bool(has_abstract)
# Convert has_fulltext to boolean
has_fulltext_bool = _str_to_bool(has_fulltext)
# Convert include_abstract to bool (keep default False)
include_abstract_bool = _str_to_bool(include_abstract) or False
```
这样可以保持行为不变,同时集中管理转换逻辑。
### 2. 将参数规范化与搜索编排解耦
`search_literature` 目前同时承担输入规范化和搜索编排两件事。可以把规范化部分抽到一个 helper 中,让顶层工具函数更简洁:
```python
def _normalize_search_params(
query: str,
mailto: str | None,
author: str | None,
institution: str | None,
source: str | None,
year_from: str | None,
year_to: str | None,
is_oa: str | None,
work_type: str | None,
language: str | None,
is_retracted: str | None,
has_abstract: str | None,
has_fulltext: str | None,
sort_by: str,
max_results: str | int,
data_sources: list[str] | None,
include_abstract: str | bool,
) -> tuple[SearchRequest, str, bool, str | None]:
# year parsing & clamping
year_from_int = int(year_from) if year_from and str(year_from).strip() else None
year_to_int = int(year_to) if year_to and str(year_to).strip() else None
max_year = datetime.now().year + 1
year_warning = ""
if year_from_int is not None and year_from_int > max_year:
year_warning += f"year_from {year_from_int}→{max_year}. "
year_from_int = max_year
if year_to_int is not None and year_to_int < 1700:
year_warning += f"year_to {year_to_int}→1700. "
year_to_int = 1700
request = SearchRequest(
query=query,
author=author,
institution=institution,
source=source,
year_from=year_from_int,
year_to=year_to_int,
is_oa=_str_to_bool(is_oa),
work_type=work_type,
language=language,
is_retracted=_str_to_bool(is_retracted),
has_abstract=_str_to_bool(has_abstract),
has_fulltext=_str_to_bool(has_fulltext),
sort_by=sort_by,
max_results=int(max_results) if max_results else 50,
data_sources=data_sources,
)
include_abstract_bool = _str_to_bool(include_abstract) or False
openalex_email = mailto.strip() if mailto and str(mailto).strip() else None
return request, year_warning, include_abstract_bool, openalex_email
```
这样 `search_literature` 就可以简化为:
```python
@mcp.tool()
async def search_literature(... ) -> str:
try:
request, year_warning, include_abstract_bool, openalex_email = _normalize_search_params(
query,
mailto,
author,
institution,
source,
year_from,
year_to,
is_oa,
work_type,
language,
is_retracted,
has_abstract,
has_fulltext,
sort_by,
max_results,
data_sources,
include_abstract,
)
logger.info(
"Literature search requested: query='%s', mailto=%s, max_results=%s",
request.query,
openalex_email,
request.max_results,
)
async with WorkDistributor(openalex_email=openalex_email) as distributor:
result = await distributor.search(request)
if year_warning:
result.setdefault("warnings", []).append(f"⚠️ Year adjusted: {year_warning.strip()}")
return _format_search_result(request, result, include_abstract_bool)
...
```
这样可以在保证行为完全一致的同时,实现关注点分离。
### 3. 将 work → dict 的转换从 `_format_search_result` 中抽离
JSON 区块当前将格式化逻辑和数据形状逻辑混在一起。可以把映射提取到一个小 helper 中:
```python
def _works_to_public_dicts(works: list, include_abstract: bool) -> list[dict[str, Any]]:
items: list[dict[str, Any]] = []
for work in works:
work_data: dict[str, Any] = {
"id": work.id,
"doi": work.doi,
"title": work.title,
"authors": work.authors[:5],
"publication_year": work.publication_year,
"cited_by_count": work.cited_by_count,
"journal": work.journal,
"is_oa": work.is_oa,
"oa_url": work.oa_url,
"source": work.source,
}
if include_abstract and work.abstract:
work_data["abstract"] = work.abstract
items.append(work_data)
return items
```
然后在 `_format_search_result` 中:
```python
sections.append("```json")
works_dict = _works_to_public_dicts(works, include_abstract)
sections.append(json.dumps(works_dict, indent=2, ensure_ascii=False))
sections.append("```")
```
这样可以缩短 `_format_search_result`,并为以后演进 JSON 表示提供一个单一的修改入口。
</issue_to_address>帮我变得更有用!请在每条评论上点 👍 或 👎,我会根据你的反馈改进后续的评审。
Original comment in English
Hey - I've found 8 issues, and left some high level feedback:
- There are unresolved merge conflict markers (e.g.,
<<<<<<< HEAD/>>>>>>>) left in several new test files such astest_openalex_client.pyandtest_base_client.py; these need to be resolved and the intended assertions/code retained before merging. - The tests add expectations for
BaseLiteratureClient.get_by_doi/get_by_idand corresponding OpenAlexClient behavior, but theBaseLiteratureClientandOpenAlexClientimplementations only definesearch; either remove/adjust these tests or add the corresponding abstract methods and concrete implementations.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There are unresolved merge conflict markers (e.g., `<<<<<<< HEAD` / `>>>>>>>`) left in several new test files such as `test_openalex_client.py` and `test_base_client.py`; these need to be resolved and the intended assertions/code retained before merging.
- The tests add expectations for `BaseLiteratureClient.get_by_doi`/`get_by_id` and corresponding OpenAlexClient behavior, but the `BaseLiteratureClient` and `OpenAlexClient` implementations only define `search`; either remove/adjust these tests or add the corresponding abstract methods and concrete implementations.
## Individual Comments
### Comment 1
<location> `service/app/utils/literature/openalex_client.py:266-275` </location>
<code_context>
+ sections.append("## Complete Works List (JSON)\n")
+ if include_abstract:
+ sections.append("The following JSON contains all works with full abstracts:\n")
+ else:
+ sections.append("The following JSON contains all works (abstracts excluded to save tokens):\n")
+ sections.append("```json")
</code_context>
<issue_to_address>
**issue (bug_risk):** 4xx errors (other than 403) are retried despite the comment saying they should not be.
Because `response.raise_for_status()` is wrapped by a generic `except Exception`, any non-403 4xx currently enters the retry loop, contradicting the “don’t retry on 4xx (except 403)” contract. Please either short‑circuit 4xx (e.g., `400 <= status_code < 500 and status_code != 403` → raise/return without retry) or narrow the `except` so `raise_for_status()` failures aren’t treated as transient errors.
</issue_to_address>
### Comment 2
<location> `service/app/utils/literature/work_distributor.py:90-95` </location>
<code_context>
+ """
+ # Clamp max_results to 50/1000 with warnings
+ all_warnings: list[str] = []
+ if request.max_results < 1:
+ all_warnings.append("⚠️ max_results < 1; using default 50")
+ request.max_results = 50
+ elif request.max_results > 1000:
+ all_warnings.append("⚠️ max_results > 1000; using 1000")
+ request.max_results = 1000
+
+ # Determine which data sources to use
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Mutating `request.max_results` in-place inside the distributor may have surprising side effects for callers.
Because the distributor writes back into the shared `SearchRequest` object, any caller that reuses this instance will see a different `max_results` than it originally set, which is subtle and hard to debug. Consider deriving a local `max_results` (or working on a shallow copy) instead of mutating the incoming request.
Suggested implementation:
```python
# Clamp max_results to 50/1000 with warnings, without mutating the request
all_warnings: list[str] = []
max_results = request.max_results
if max_results < 1:
all_warnings.append("⚠️ max_results < 1; using default 50")
max_results = 50
elif max_results > 1000:
all_warnings.append("⚠️ max_results > 1000; using 1000")
max_results = 1000
# Determine which data sources to use
sources = request.data_sources or ["openalex"]
```
Anywhere later in this method where `request.max_results` is used to control pagination or limit result counts should be updated to use the local `max_results` variable instead. For example, calls like `client.search(request, ...)` that rely on `request.max_results` should either (a) pass `max_results` explicitly as an argument, or (b) construct a shallow copy of `request` with the adjusted `max_results` to keep the original object immutable for callers.
</issue_to_address>
### Comment 3
<location> `service/app/utils/literature/openalex_client.py:274-283` </location>
<code_context>
+ except httpx.HTTPError as e:
+ logger.error(f"Literature search network error: {e}", exc_info=True)
+ return "❌ Network error while contacting literature sources. Please try again later."
+ except Exception as e:
+ logger.error(f"Literature search failed: {e}", exc_info=True)
+ return "❌ Unexpected error during search. Please retry or contact support."
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Page-fetch errors are swallowed, potentially returning partial results with no surfaced warning.
Currently, `_fetch_all_pages` logs exceptions and breaks the loop, returning only the pages fetched so far. To avoid silently treating partial results as a successful full search, either propagate the exception so callers can handle it explicitly, or return a structured result that includes both the works and an explicit partial-results/error indicator for the caller to surface.
</issue_to_address>
### Comment 4
<location> `service/app/mcp/literature.py:261-262` </location>
<code_context>
+ conditions.append(f"- **Work Type**: {request.work_type}")
+ if request.language:
+ conditions.append(f"- **Language**: {request.language}")
+ if request.is_retracted is not None:
+ conditions.append(f"- **Exclude Retracted**: {'No' if request.is_retracted else 'Yes'}")
+ if request.has_abstract is not None:
+ conditions.append(f"- **Require Abstract**: {'Yes' if request.has_abstract else 'No'}")
</code_context>
<issue_to_address>
**suggestion:** The "Exclude Retracted" label is inverted/confusing relative to the `is_retracted` flag semantics.
Given the docstring, `is_retracted=True` means “only retracted” and `False` means “exclude retracted”. With the current rendering, `Exclude Retracted` will show `No` when `True` and `Yes` when `False`, which is logically consistent but hard to interpret. Consider renaming/rephrasing this line to describe the actual filter mode (e.g., `Retracted Filter: only retracted / exclude retracted`) rather than expressing it as an inverted “Exclude” flag.
```suggestion
if request.is_retracted is not None:
retracted_mode = "Only retracted" if request.is_retracted else "Exclude retracted"
conditions.append(f"- **Retracted Filter**: {retracted_mode}")
```
</issue_to_address>
### Comment 5
<location> `service/tests/unit/test_literature/test_openalex_client.py:126-129` </location>
<code_context>
+
+ assert len(works) == 1
+ mock_resolve.assert_called_once_with("Jane Smith")
+<<<<<<< HEAD:service/tests/unit/test_literature/test_openalex_client.py
+ assert any("Author resolved" in msg for msg in warnings)
+=======
+>>>>>>> 1794485dfcc48ad6b089f2b31eb788a021eadea5:service/tests/unit/test_utils/test_openalex_client.py
+
+ @pytest.mark.asyncio
</code_context>
<issue_to_address>
**issue (bug_risk):** Resolve merge conflict markers in OpenAlex client tests to restore a valid test module.
There are unresolved merge conflict markers in this test file (around the author/institution/source filter tests and later in the file). In its current state the file will not parse and pytest will not collect these tests. Please resolve the conflicts (pick the intended assertions, e.g. the `assert any(... "Author resolved" ...)` checks, or restore any removed tests as needed) and ensure the file is syntactically valid before merging.
</issue_to_address>
### Comment 6
<location> `service/tests/unit/test_literature/test_base_client.py:16-25` </location>
<code_context>
+ """Dummy search implementation."""
+ return [], []
+
+<<<<<<< HEAD:service/tests/unit/test_literature/test_base_client.py
+=======
+ async def get_by_doi(self, doi: str) -> LiteratureWork | None:
+ """Dummy get_by_doi implementation."""
+ return None
+
+ async def get_by_id(self, work_id: str) -> LiteratureWork | None:
+ """Dummy get_by_id implementation."""
+ return None
+
+>>>>>>> 1794485dfcc48ad6b089f2b31eb788a021eadea5:service/tests/unit/test_utils/test_base_client.py
+
+class TestBaseLiteratureClientProtocol:
</code_context>
<issue_to_address>
**issue (testing):** Fix merge conflict in BaseLiteratureClient tests so they match the current abstract interface.
This file still has merge conflict markers and two conflicting `ConcreteClient` versions. The current `BaseLiteratureClient` only defines `search`, so tests relying on `get_by_doi`/`get_by_id` being abstract will no longer be valid. Please resolve the conflict by keeping a single `ConcreteClient` implementation aligned with the current base class and remove or adjust tests for methods that are no longer part of the abstraction.
</issue_to_address>
### Comment 7
<location> `service/app/utils/literature/openalex_client.py:239` </location>
<code_context>
+
+ return params
+
+ async def _resolve_author_id(self, author_name: str) -> tuple[str | None, bool, str]:
+ """
+ Two-step lookup: author name -> author ID
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the OpenAlex client to use shared helpers and data-driven mappings to remove repeated request and filter-building logic while keeping behavior identical.
You can reduce duplication quite a bit without changing behavior by introducing a couple of small helpers and some data‑driven mappings.
### 1. Unify `_resolve_*_id` into a generic helper
All three resolution methods share the same structure. You can factor out the common logic and keep thin, readable wrappers:
```python
async def _resolve_entity_id(
self,
*,
kind: str, # "Author" / "Institution" / "Source"
endpoint: str, # "authors" / "institutions" / "sources"
name: str,
not_found_suggestions: str,
) -> tuple[str | None, bool, str]:
async with self.rate_limiter:
try:
url = f"{self.BASE_URL}/{endpoint}"
params: dict[str, str] = {"search": name}
if self.email:
params["mailto"] = self.email
response = await self._request_with_retry(url, params)
if results := response.get("results", []):
entity = results[0]
entity_id = entity["id"].split("/")[-1]
display = entity.get("display_name", name)
logger.info(f"Resolved {kind.lower()} '{name}' -> {entity_id}")
return entity_id, True, f"✓ {kind} resolved: '{name}' -> '{display}'"
msg = f"⚠️ {kind} '{name}' not found. {not_found_suggestions}"
logger.warning(msg)
return None, False, msg
except Exception as e:
msg = f"⚠️ Failed to resolve {kind.lower()} '{name}': {e}"
logger.warning(msg)
return None, False, msg
```
Then your specific resolvers become small and self‑documenting:
```python
async def _resolve_author_id(self, author_name: str) -> tuple[str | None, bool, str]:
return await self._resolve_entity_id(
kind="Author",
endpoint="authors",
name=author_name,
not_found_suggestions=(
"Suggestions: (1) Try full name format like 'Smith, John' or 'John Smith', "
"(2) Check spelling, (3) Try removing middle name/initial."
),
)
async def _resolve_institution_id(self, institution_name: str) -> tuple[str | None, bool, str]:
return await self._resolve_entity_id(
kind="Institution",
endpoint="institutions",
name=institution_name,
not_found_suggestions=(
"Suggestions: (1) Use full official name (e.g., 'Harvard University' not 'Harvard'), "
"(2) Try variations (e.g., 'MIT' vs 'Massachusetts Institute of Technology'), "
"(3) Check spelling."
),
)
async def _resolve_source_id(self, source_name: str) -> tuple[str | None, bool, str]:
return await self._resolve_entity_id(
kind="Source/Journal",
endpoint="sources",
name=source_name,
not_found_suggestions=(
"Suggestions: (1) Use full journal name (e.g., 'Nature' or 'Science'), "
"(2) Try alternative names (e.g., 'JAMA' vs 'Journal of the American Medical Association'), "
"(3) Check spelling."
),
)
```
This keeps all messages and behavior intact while removing ~100 lines of repetition.
---
### 2. Hide rate limiting + retry behind a single `_get_json` helper
You already have `_request_with_retry`; wrapping the rate limiter once will simplify all call sites:
```python
async def _get_json(self, path: str, params: dict[str, str]) -> dict[str, Any]:
async with self.rate_limiter:
if self.email and "mailto" not in params:
params = {**params, "mailto": self.email}
url = f"{self.BASE_URL}/{path}"
return await self._request_with_retry(url, params)
```
Then, for example, `_fetch_all_pages` becomes:
```python
async def _fetch_all_pages(self, params: dict[str, str], max_results: int) -> list[dict[str, Any]]:
all_works: list[dict[str, Any]] = []
page = 1
while len(all_works) < max_results:
try:
page_params = {**params, "page": str(page)}
response = await self._get_json("works", page_params)
works = response.get("results", [])
if not works:
break
all_works.extend(works)
logger.info(f"Fetched page {page}: {len(works)} works")
meta = response.get("meta", {})
total_count = meta.get("count", 0)
if len(all_works) >= total_count:
break
page += 1
except Exception as e:
logger.error(f"Error fetching page {page}: {e}")
break
return all_works[:max_results]
```
And the generic resolver above can call `_get_json(endpoint, params)` instead of constructing URLs and `async with self.rate_limiter:` itself.
---
### 3. Make `_build_query_params` more declarative
The filter construction can be made shorter and more obvious using small mappings. For example, boolean filters:
```python
bool_filters = {
"is_oa": request.is_oa,
"is_retracted": request.is_retracted,
"has_abstract": request.has_abstract,
"has_fulltext": request.has_fulltext,
}
for field, value in bool_filters.items():
if value is not None:
filters.append(f"{field}:{str(value).lower()}")
```
Likewise for simple scalar filters:
```python
scalar_filters = {
"type": request.work_type,
"language": request.language,
}
for field, value in scalar_filters.items():
if value:
filters.append(f"{field}:{value}")
```
And for the ID‑based filters:
```python
id_filters = {
"authorships.author.id": author_id,
"authorships.institutions.id": institution_id,
"primary_location.source.id": source_id,
}
for field, value in id_filters.items():
if value:
filters.append(f"{field}:{value}")
```
Combined, `_build_query_params` keeps the same semantics but with fewer branches and less repetition, making it easier to adjust the mapping from `SearchRequest` fields to OpenAlex filters later.
</issue_to_address>
### Comment 8
<location> `service/app/mcp/literature.py:139` </location>
<code_context>
+ year_warning += f"year_to {year_to_int}→1700. "
+ year_to_int = 1700
+
+ # Convert is_oa to boolean
+ is_oa_bool: bool | None = None
+ if is_oa is not None:
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for parameter normalization, boolean parsing, and work-to-dict conversion to simplify `search_literature` and `_format_search_result` while keeping behavior unchanged.
You can reduce complexity and duplication without changing behavior by extracting a few small helpers.
### 1. Deduplicate string→bool conversions
You currently repeat the same pattern for `is_oa`, `is_retracted`, `has_abstract`, `has_fulltext`, and `include_abstract`. Extract a single utility:
```python
def _str_to_bool(value: str | bool | None) -> bool | None:
if value is None:
return None
if isinstance(value, bool):
return value
return str(value).lower() in ("true", "1", "yes")
```
Then use it in `search_literature`:
```python
# Convert is_oa to boolean
is_oa_bool = _str_to_bool(is_oa)
# Convert is_retracted to boolean
is_retracted_bool = _str_to_bool(is_retracted)
# Convert has_abstract to boolean
has_abstract_bool = _str_to_bool(has_abstract)
# Convert has_fulltext to boolean
has_fulltext_bool = _str_to_bool(has_fulltext)
# Convert include_abstract to bool (keep default False)
include_abstract_bool = _str_to_bool(include_abstract) or False
```
This keeps all behavior but centralizes the conversion logic.
### 2. Isolate parameter normalization from orchestration
`search_literature` does both input normalization and search orchestration. You can pull the normalization into a helper and keep the top-level tool thin:
```python
def _normalize_search_params(
query: str,
mailto: str | None,
author: str | None,
institution: str | None,
source: str | None,
year_from: str | None,
year_to: str | None,
is_oa: str | None,
work_type: str | None,
language: str | None,
is_retracted: str | None,
has_abstract: str | None,
has_fulltext: str | None,
sort_by: str,
max_results: str | int,
data_sources: list[str] | None,
include_abstract: str | bool,
) -> tuple[SearchRequest, str, bool, str | None]:
# year parsing & clamping
year_from_int = int(year_from) if year_from and str(year_from).strip() else None
year_to_int = int(year_to) if year_to and str(year_to).strip() else None
max_year = datetime.now().year + 1
year_warning = ""
if year_from_int is not None and year_from_int > max_year:
year_warning += f"year_from {year_from_int}→{max_year}. "
year_from_int = max_year
if year_to_int is not None and year_to_int < 1700:
year_warning += f"year_to {year_to_int}→1700. "
year_to_int = 1700
request = SearchRequest(
query=query,
author=author,
institution=institution,
source=source,
year_from=year_from_int,
year_to=year_to_int,
is_oa=_str_to_bool(is_oa),
work_type=work_type,
language=language,
is_retracted=_str_to_bool(is_retracted),
has_abstract=_str_to_bool(has_abstract),
has_fulltext=_str_to_bool(has_fulltext),
sort_by=sort_by,
max_results=int(max_results) if max_results else 50,
data_sources=data_sources,
)
include_abstract_bool = _str_to_bool(include_abstract) or False
openalex_email = mailto.strip() if mailto and str(mailto).strip() else None
return request, year_warning, include_abstract_bool, openalex_email
```
Then `search_literature` becomes simpler:
```python
@mcp.tool()
async def search_literature(... ) -> str:
try:
request, year_warning, include_abstract_bool, openalex_email = _normalize_search_params(
query,
mailto,
author,
institution,
source,
year_from,
year_to,
is_oa,
work_type,
language,
is_retracted,
has_abstract,
has_fulltext,
sort_by,
max_results,
data_sources,
include_abstract,
)
logger.info(
"Literature search requested: query='%s', mailto=%s, max_results=%s",
request.query,
openalex_email,
request.max_results,
)
async with WorkDistributor(openalex_email=openalex_email) as distributor:
result = await distributor.search(request)
if year_warning:
result.setdefault("warnings", []).append(f"⚠️ Year adjusted: {year_warning.strip()}")
return _format_search_result(request, result, include_abstract_bool)
...
```
This keeps behavior identical but separates concerns.
### 3. Extract work→dict conversion from `_format_search_result`
The JSON block mixes formatting with data-shape logic. You can pull the mapping into a small helper:
```python
def _works_to_public_dicts(works: list, include_abstract: bool) -> list[dict[str, Any]]:
items: list[dict[str, Any]] = []
for work in works:
work_data: dict[str, Any] = {
"id": work.id,
"doi": work.doi,
"title": work.title,
"authors": work.authors[:5],
"publication_year": work.publication_year,
"cited_by_count": work.cited_by_count,
"journal": work.journal,
"is_oa": work.is_oa,
"oa_url": work.oa_url,
"source": work.source,
}
if include_abstract and work.abstract:
work_data["abstract"] = work.abstract
items.append(work_data)
return items
```
Then in `_format_search_result`:
```python
sections.append("```json")
works_dict = _works_to_public_dicts(works, include_abstract)
sections.append(json.dumps(works_dict, indent=2, ensure_ascii=False))
sections.append("```")
```
This makes `_format_search_result` shorter and gives you a single place to evolve the JSON representation later.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| else: | ||
| msg = ( | ||
| f"⚠️ Author '{author_name}' not found. " | ||
| f"Suggestions: (1) Try full name format like 'Smith, John' or 'John Smith', " | ||
| f"(2) Check spelling, (3) Try removing middle name/initial." | ||
| ) | ||
| logger.warning(msg) | ||
| return None, False, msg | ||
| except Exception as e: | ||
| msg = f"⚠️ Failed to resolve author '{author_name}': {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): 除 403 以外的 4xx 错误仍会被重试,这与注释所声称的行为不一致。
由于 response.raise_for_status() 被包装在一个通用的 except Exception 中,任何非 403 的 4xx 目前都会进入重试循环,这和“不要在 4xx(403 除外)时重试”的约定相矛盾。请要么在 4xx 时直接短路(例如对 400 <= status_code < 500 且 status_code != 403 直接抛出/返回、不再重试),要么收窄 except 的范围,使 raise_for_status() 抛出的错误不会被当作瞬时错误来重试。
Original comment in English
issue (bug_risk): 4xx errors (other than 403) are retried despite the comment saying they should not be.
Because response.raise_for_status() is wrapped by a generic except Exception, any non-403 4xx currently enters the retry loop, contradicting the “don’t retry on 4xx (except 403)” contract. Please either short‑circuit 4xx (e.g., 400 <= status_code < 500 and status_code != 403 → raise/return without retry) or narrow the except so raise_for_status() failures aren’t treated as transient errors.
| if request.max_results < 1: | ||
| all_warnings.append("⚠️ max_results < 1; using default 50") | ||
| request.max_results = 50 | ||
| elif request.max_results > 1000: | ||
| all_warnings.append("⚠️ max_results > 1000; using 1000") | ||
| request.max_results = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): 在分发器内部原地修改 request.max_results 可能会给调用方带来意料之外的副作用。
由于分发器会回写到共享的 SearchRequest 对象中,任何重用该实例的调用方都会看到与最初设置不同的 max_results,这既隐蔽又难以排查。建议派生出一个本地的 max_results(或在一个浅拷贝上操作),而不是修改传入的请求对象。
Suggested implementation:
# Clamp max_results to 50/1000 with warnings, without mutating the request
all_warnings: list[str] = []
max_results = request.max_results
if max_results < 1:
all_warnings.append("⚠️ max_results < 1; using default 50")
max_results = 50
elif max_results > 1000:
all_warnings.append("⚠️ max_results > 1000; using 1000")
max_results = 1000
# Determine which data sources to use
sources = request.data_sources or ["openalex"]在本方法后面任何使用 request.max_results 控制分页或限制结果数量的地方,都应改为使用本地的 max_results 变量。例如,类似 client.search(request, ...) 的调用如果依赖 request.max_results,要么显式把 max_results 作为参数传入,要么构造一个带有调整后 max_results 的请求浅拷贝,以保证对调用方来说原始对象保持不变。
Original comment in English
suggestion (bug_risk): Mutating request.max_results in-place inside the distributor may have surprising side effects for callers.
Because the distributor writes back into the shared SearchRequest object, any caller that reuses this instance will see a different max_results than it originally set, which is subtle and hard to debug. Consider deriving a local max_results (or working on a shallow copy) instead of mutating the incoming request.
Suggested implementation:
# Clamp max_results to 50/1000 with warnings, without mutating the request
all_warnings: list[str] = []
max_results = request.max_results
if max_results < 1:
all_warnings.append("⚠️ max_results < 1; using default 50")
max_results = 50
elif max_results > 1000:
all_warnings.append("⚠️ max_results > 1000; using 1000")
max_results = 1000
# Determine which data sources to use
sources = request.data_sources or ["openalex"]Anywhere later in this method where request.max_results is used to control pagination or limit result counts should be updated to use the local max_results variable instead. For example, calls like client.search(request, ...) that rely on request.max_results should either (a) pass max_results explicitly as an argument, or (b) construct a shallow copy of request with the adjusted max_results to keep the original object immutable for callers.
| except Exception as e: | ||
| msg = f"⚠️ Failed to resolve author '{author_name}': {e}" | ||
| logger.warning(msg) | ||
| return None, False, msg | ||
|
|
||
| async def _resolve_institution_id(self, institution_name: str) -> tuple[str | None, bool, str]: | ||
| """ | ||
| Two-step lookup: institution name -> institution ID | ||
|
|
||
| Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): 抓取页面时的错误会被吞掉,可能在没有任何显式警告的情况下返回部分结果。
当前 _fetch_all_pages 会记录异常并跳出循环,只返回目前已获取到的页面。为避免将“部分结果”在静默情况下当作“完整搜索成功”来处理,要么让异常向上传递,以便调用方显式处理,要么返回一个结构化的结果,其中既包含 works,又包含明确的“部分结果/错误”标记,供调用方对外展示。
Original comment in English
suggestion (bug_risk): Page-fetch errors are swallowed, potentially returning partial results with no surfaced warning.
Currently, _fetch_all_pages logs exceptions and breaks the loop, returning only the pages fetched so far. To avoid silently treating partial results as a successful full search, either propagate the exception so callers can handle it explicitly, or return a structured result that includes both the works and an explicit partial-results/error indicator for the caller to surface.
| if request.is_retracted is not None: | ||
| conditions.append(f"- **Exclude Retracted**: {'No' if request.is_retracted else 'Yes'}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: “Exclude Retracted” 这个标签与 is_retracted 标志位的语义相反/容易令人困惑。
根据 docstring,is_retracted=True 表示“只要已撤稿的文献”,False 表示“排除已撤稿”。在当前渲染方式下,当 is_retracted=True 时会显示 Exclude Retracted: No,而 False 时显示 Yes,虽然在逻辑上自洽,但很难理解。建议重命名/重写这一行,用真正的过滤模式来描述(例如 Retracted Filter: only retracted / exclude retracted),而不是用一个语义反转的 “Exclude” 标志来表达。
| if request.is_retracted is not None: | |
| conditions.append(f"- **Exclude Retracted**: {'No' if request.is_retracted else 'Yes'}") | |
| if request.is_retracted is not None: | |
| retracted_mode = "Only retracted" if request.is_retracted else "Exclude retracted" | |
| conditions.append(f"- **Retracted Filter**: {retracted_mode}") |
Original comment in English
suggestion: The "Exclude Retracted" label is inverted/confusing relative to the is_retracted flag semantics.
Given the docstring, is_retracted=True means “only retracted” and False means “exclude retracted”. With the current rendering, Exclude Retracted will show No when True and Yes when False, which is logically consistent but hard to interpret. Consider renaming/rephrasing this line to describe the actual filter mode (e.g., Retracted Filter: only retracted / exclude retracted) rather than expressing it as an inverted “Exclude” flag.
| if request.is_retracted is not None: | |
| conditions.append(f"- **Exclude Retracted**: {'No' if request.is_retracted else 'Yes'}") | |
| if request.is_retracted is not None: | |
| retracted_mode = "Only retracted" if request.is_retracted else "Exclude retracted" | |
| conditions.append(f"- **Retracted Filter**: {retracted_mode}") |
| <<<<<<< HEAD:service/tests/unit/test_literature/test_openalex_client.py | ||
| assert any("Author resolved" in msg for msg in warnings) | ||
| ======= | ||
| >>>>>>> 1794485dfcc48ad6b089f2b31eb788a021eadea5:service/tests/unit/test_utils/test_openalex_client.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): 需要解决 OpenAlex client 测试中的合并冲突标记,以恢复有效的测试模块。
该测试文件中仍然存在未解决的合并冲突标记(大致位于 author/institution/source 过滤测试附近,以及文件后面的位置)。在当前状态下,文件无法通过解析,pytest 也无法收集这些测试。请解决这些冲突(选择你想要保留的断言,例如 assert any(... "Author resolved" ...) 检查,或按需恢复被删除的测试),并确保文件在合并前语法正确。
Original comment in English
issue (bug_risk): Resolve merge conflict markers in OpenAlex client tests to restore a valid test module.
There are unresolved merge conflict markers in this test file (around the author/institution/source filter tests and later in the file). In its current state the file will not parse and pytest will not collect these tests. Please resolve the conflicts (pick the intended assertions, e.g. the assert any(... "Author resolved" ...) checks, or restore any removed tests as needed) and ensure the file is syntactically valid before merging.
| <<<<<<< HEAD:service/tests/unit/test_literature/test_base_client.py | ||
| ======= | ||
| async def get_by_doi(self, doi: str) -> LiteratureWork | None: | ||
| """Dummy get_by_doi implementation.""" | ||
| return None | ||
|
|
||
| async def get_by_id(self, work_id: str) -> LiteratureWork | None: | ||
| """Dummy get_by_id implementation.""" | ||
| return None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (testing): 修复 BaseLiteratureClient 测试中的合并冲突,使其与当前的抽象接口保持一致。
该文件仍然包含合并冲突标记,并有两个互相冲突的 ConcreteClient 版本。当前的 BaseLiteratureClient 只定义了 search,因此依赖 get_by_doi/get_by_id 是抽象方法的测试不再有效。请通过保留一个与当前基类保持一致的 ConcreteClient 实现来解决冲突,并删除或调整那些基于已不再属于抽象接口的方法所写的测试。
Original comment in English
issue (testing): Fix merge conflict in BaseLiteratureClient tests so they match the current abstract interface.
This file still has merge conflict markers and two conflicting ConcreteClient versions. The current BaseLiteratureClient only defines search, so tests relying on get_by_doi/get_by_id being abstract will no longer be valid. Please resolve the conflict by keeping a single ConcreteClient implementation aligned with the current base class and remove or adjust tests for methods that are no longer part of the abstraction.
refactor: Recover .gitignore, clean up test utilities and remove unused files
Summary by Sourcery
新增一个文献检索子系统,包括 OpenAlex API 客户端、DOI 规范化/去重工具、多源文献分发器,以及 MCP 服务器入口点,并为这些组件提供了完整的单元测试覆盖。
新功能:
SearchRequest和LiteratureWork模型,用于文献检索请求和结果表示。OpenAlexClient,支持限流、重试,并将 OpenAlex 的响应转换为共享模型。WorkDistributor,用于聚合、去重并排序来自多个文献数据源的结果。增强:
测试:
Original summary in English
Summary by Sourcery
Add a new literature search subsystem with an OpenAlex API client, DOI normalization/deduplication helpers, a multi-source work distributor, and an MCP server entrypoint, along with comprehensive unit test coverage for these components.
New Features:
Enhancements:
Tests: