Skip to content

Conversation

@SoberPizza
Copy link
Collaborator

@SoberPizza SoberPizza commented Jan 20, 2026

Summary by Sourcery

添加一个多源学术文献检索子系统,包括 MCP 服务器、OpenAlex API 客户端、共享模型、DOI 规范化/去重工具以及任务分发器,并配套完善的单元测试和文档。

New Features:

  • 引入一个 Literature MCP 服务器,提供统一的工具,用于多源学术文献检索,支持高级过滤和格式化的 Markdown 输出。
  • 添加一个 OpenAlex API 客户端,按照 LLM 使用的最佳实践实现,包括限流、重试、名称到 ID 的解析以及摘要重建。
  • 提供共享模型,用于在不同数据源之间标准化检索请求和文献工作(work)表示。
  • 实现一个任务分发器,用于在一个或多个文献提供方之间协同执行检索,并对结果进行聚合、排序和去重。
  • 添加 DOI 规范化和基于 DOI 的去重工具,以统一标识符并优先选择更高质量的记录。

Enhancements:

  • 编写文献 MCP 的架构、功能以及高效、安全使用检索工具的最佳实践文档。

Tests:

  • 为 OpenAlex 客户端、任务分发器行为、DOI 工具、基础客户端协议以及共享模型添加广泛的单元测试覆盖。
Original summary in English

Summary by Sourcery

Add a multi-source academic literature search subsystem, including an MCP server, OpenAlex API client, shared models, DOI normalization/deduplication utilities, and a work distributor, along with comprehensive unit tests and documentation.

New Features:

  • Introduce a Literature MCP server that exposes a unified tool for multi-source academic literature search with advanced filtering and formatted markdown output.
  • Add an OpenAlex API client implementing best-practice patterns for LLM usage, including rate limiting, retries, name-to-ID resolution, and abstract reconstruction.
  • Provide shared models for standardized search requests and literature work representations across data sources.
  • Implement a work distributor to orchestrate searches across one or more literature providers and aggregate, sort, and deduplicate results.
  • Add DOI normalization and DOI-based deduplication utilities to unify identifiers and prefer higher-quality records.

Enhancements:

  • Document the literature MCP architecture, capabilities, and best practices for using the search tools efficiently and safely.

Tests:

  • Add extensive unit test coverage for the OpenAlex client, work distributor behavior, DOI utilities, base client protocol, and shared models.

Copilot AI review requested due to automatic review settings January 20, 2026 16:27
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 20, 2026

Reviewer's Guide

引入了一个新的多源文献搜索子系统,围绕 OpenAlex 客户端、DOI 规范化/去重工具、工作分发器以及 MCP 服务器工具构建,并为每个组件提供了完整的数据模型和测试。

search_literature MCP 工具流程的时序图

sequenceDiagram
    actor User
    participant MCPClient
    participant LiteratureMCP as FastMCP_literature_server
    participant Tool as search_literature_tool
    participant Distributor as WorkDistributor
    participant OAClient as OpenAlexClient
    participant OpenAlexAPI

    User->>MCPClient: Configure and invoke search_literature
    MCPClient->>LiteratureMCP: JSON-RPC tool call
    LiteratureMCP->>Tool: search_literature(query, filters,...)

    Tool->>Tool: Parse and coerce parameters
    Tool->>Tool: Build SearchRequest

    Tool->>Distributor: create WorkDistributor(openalex_email)
    Tool->>Distributor: search(request: SearchRequest)

    loop For each data source
        alt openalex enabled
            Distributor->>OAClient: search(request)

            OAClient->>OAClient: _build_query_params(...)

            par Resolve optional IDs
                OAClient->>OpenAlexAPI: GET /authors?search=author
                OpenAlexAPI-->>OAClient: author results
                OAClient->>OpenAlexAPI: GET /institutions?search=institution
                OpenAlexAPI-->>OAClient: institution results
                OAClient->>OpenAlexAPI: GET /sources?search=source
                OpenAlexAPI-->>OAClient: source results
            end

            loop Paginated works fetch
                OAClient->>OpenAlexAPI: GET /works?page=n&per-page=200
                OpenAlexAPI-->>OAClient: works page
                OAClient->>OAClient: _request_with_retry(...) with backoff
            end

            OAClient->>Distributor: works, warnings
        else unsupported source
            Distributor->>Distributor: record warning
        end
    end

    Distributor->>Distributor: deduplicate_by_doi(all_works)
    Distributor->>Distributor: _sort_works(...)
    Distributor-->>Tool: aggregated_result

    Tool->>Tool: _format_search_result(...)
    Tool-->>LiteratureMCP: Markdown report string
    LiteratureMCP-->>MCPClient: Tool result
    MCPClient-->>User: Rendered report + JSON works
Loading

新文献搜索子系统的类图

classDiagram

    class SearchRequest {
        +str query
        +str author
        +str institution
        +str source
        +int year_from
        +int year_to
        +bool is_oa
        +str work_type
        +str language
        +bool is_retracted
        +bool has_abstract
        +bool has_fulltext
        +str sort_by
        +int max_results
        +list~str~ data_sources
    }

    class LiteratureWork {
        +str id
        +str doi
        +str title
        +list~dict~ authors
        +int publication_year
        +int cited_by_count
        +str abstract
        +str journal
        +bool is_oa
        +str oa_url
        +str source
        +dict raw_data
    }

    class BaseLiteratureClient {
        <<abstract>>
        +search(request: SearchRequest) tuple~list~LiteratureWork~~ list~str~~
    }

    class OpenAlexClient {
        -str BASE_URL
        -int MAX_PER_PAGE
        -int MAX_RETRIES
        -float TIMEOUT
        -str email
        -int rate_limit
        -_RateLimiter rate_limiter
        -AsyncClient client
        +OpenAlexClient(email: str, rate_limit: int, timeout: float)
        +str pool_type
        +search(request: SearchRequest) tuple~list~LiteratureWork~~ list~str~~
        -_build_query_params(request: SearchRequest, author_id: str, institution_id: str, source_id: str) dict~str, str~
        -_resolve_author_id(author_name: str) tuple~str, bool, str~
        -_resolve_institution_id(institution_name: str) tuple~str, bool, str~
        -_resolve_source_id(source_name: str) tuple~str, bool, str~
        -_fetch_all_pages(params: dict~str, str~, max_results: int) list~dict~
        -_request_with_retry(url: str, params: dict~str, str~) dict~str, any~
        -_transform_work(work: dict~str, any~) LiteratureWork
        -_reconstruct_abstract(inverted_index: dict~str, list~int~~) str
        +close() None
    }

    class _RateLimiter {
        -float _min_interval
        -Lock _lock
        -float _last_request
        -Semaphore _semaphore
        +_RateLimiter(rate_per_second: float, max_concurrency: int)
        +__aenter__() None
        +__aexit__(exc_type: type, exc: BaseException, tb: any) None
        -_throttle() None
    }

    class WorkWithDOI {
        <<interface>>
        +str doi
        +int cited_by_count
        +int publication_year
    }

    class WorkDistributor {
        -dict~str, any~ clients
        -str openalex_email
        +WorkDistributor(openalex_email: str)
        +search(request: SearchRequest) dict~str, any~
        +close() None
        -_register_clients() None
        -_sort_works(works: list~LiteratureWork~, sort_by: str) list~LiteratureWork~
    }

    class doi_cleaner_utils {
        <<utility>>
        +normalize_doi(doi: str) str
        +deduplicate_by_doi(works: list~WorkWithDOI~) list~WorkWithDOI~
    }

    BaseLiteratureClient <|-- OpenAlexClient
    WorkWithDOI <|.. LiteratureWork

    WorkDistributor --> BaseLiteratureClient : holds clients
    WorkDistributor --> LiteratureWork : aggregates
    WorkDistributor --> SearchRequest : consumes

    OpenAlexClient --> SearchRequest : consumes
    OpenAlexClient --> LiteratureWork : produces

    doi_cleaner_utils ..> WorkWithDOI : operates_on
    WorkDistributor ..> doi_cleaner_utils : uses
Loading

文件级变更

Change Details Files
Add an OpenAlex literature client implementing rate-limited, retrying search and lookup with best-practice query construction and transformation into internal models.
  • 实现 OpenAlexClient,支持可配置的 polite/default 连接池、全局限流以及异步 httpx 客户端管理
  • 新增搜索工作流:将作者/机构/来源名称解析为 ID,构建 OpenAlex 过滤条件/排序规则,分页获取 works,并将响应转换为内部的 LiteratureWork 对象
  • 提供辅助函数,用于从倒排索引重建摘要、转换原始 works,以及在超时、4xx/5xx 和限流情况下进行健壮的重试请求处理
service/app/utils/literature/openalex_client.py
service/tests/unit/test_utils/test_openalex_client.py
Introduce shared literature data models, base client abstraction, and a work distributor to coordinate multi-source searches with deduplication and sorting.
  • 定义 SearchRequest 和 LiteratureWork 数据类,作为各个提供方的统一模式
  • 创建 BaseLiteratureClient 抽象基类,提供标准的异步 search/get 方法,并通过具体的 stub 实现进行测试
  • 实现 WorkDistributor,用于注册提供方客户端(目前为 OpenAlex)、限制 max_results 上限、分发搜索、聚合结果、按 DOI 去重、排序,并暴露异步上下文/关闭语义
service/app/utils/literature/models.py
service/app/utils/literature/base_client.py
service/app/utils/literature/work_distributor.py
service/tests/unit/test_utils/test_base_client.py
service/tests/unit/test_utils/test_work_distributor.py
Add DOI normalization and deduplication utilities that select best representative works and are used by the distributor.
  • 实现 normalize_doi,用于去除常见前缀、校验 DOI 结构,并统一为小写以便一致匹配
  • 实现 deduplicate_by_doi,将作品按规范化 DOI 分组,优先选择被引次数更高、年份更新的作品,同时保留无 DOI 的作品,并提供完善的单元测试
service/app/utils/literature/doi_cleaner.py
service/tests/unit/test_utils/test_doi_cleaner.py
Expose the literature search capability as an MCP server tool that validates/coerces user parameters, invokes the distributor, and returns a structured markdown+JSON report.
  • 基于 FastMCP 创建文献 MCP,提供 search_literature 工具,用于解析字符串/基础类型参数(年份、布尔值、max_results),构造 SearchRequest,并将 WorkDistributor 与可选的 OpenAlex polite-pool 邮箱串联起来
  • 将搜索结果格式化为 Markdown 报告,其中包括筛选条件、统计信息、警告、建议以及嵌入的作品 JSON 列表(可选包含摘要),并对校验/网络错误进行处理
service/app/mcp/literature.py
Document the literature MCP system and its usage, parameters, and OpenAlex best practices.
  • 新增 README,说明多源文献 MCP 的特性、典型使用模式、过滤参数,以及与 OpenAlex 相关的最佳实践与反模式
service/app/utils/literature/README.md

Tips and commands

Interacting with Sourcery

  • 触发新评审: 在 pull request 中评论 @sourcery-ai review
  • 继续讨论: 直接回复 Sourcery 的评审评论。
  • 从评审评论生成 GitHub issue: 在回复某条评审评论时请求 Sourcery 从该评论创建 issue。也可以直接回复该评审评论 @sourcery-ai issue 来创建对应 issue。
  • 生成 pull request 标题: 在 pull request 标题任意位置写上 @sourcery-ai 即可随时生成标题。也可以在 pull request 中评论 @sourcery-ai title 来(重新)生成标题。
  • 生成 pull request 摘要: 在 pull request 正文任意位置写上 @sourcery-ai summary,即可在对应位置生成 PR 摘要。也可以在 pull request 中评论 @sourcery-ai summary 来随时(重新)生成摘要。
  • 生成评审指南: 在 pull request 中评论 @sourcery-ai guide,即可随时(重新)生成评审者指南。
  • 一次性解决所有 Sourcery 评论: 在 pull request 中评论 @sourcery-ai resolve,即可标记所有 Sourcery 评论为已解决。如果你已经处理完所有评论且不想再看到它们,这会很有用。
  • 忽略所有 Sourcery 评审: 在 pull request 中评论 @sourcery-ai dismiss,即可忽略所有现有的 Sourcery 评审。尤其适合你希望从一次全新的评审开始——记得再评论 @sourcery-ai review 来触发新评审!

Customizing Your Experience

访问你的 dashboard 以:

  • 启用或禁用评审功能,例如 Sourcery 自动生成的 pull request 摘要、评审者指南等。
  • 更改评审语言。
  • 添加、删除或编辑自定义评审指令。
  • 调整其它评审设置。

Getting Help

Original review guide in English

Reviewer's Guide

Introduces a new multi-source literature search subsystem centered around an OpenAlex client, a DOI normalization/deduplication utility, a work distributor, and an MCP server tool, along with comprehensive data models and tests for each component.

Sequence diagram for the search_literature MCP tool flow

sequenceDiagram
    actor User
    participant MCPClient
    participant LiteratureMCP as FastMCP_literature_server
    participant Tool as search_literature_tool
    participant Distributor as WorkDistributor
    participant OAClient as OpenAlexClient
    participant OpenAlexAPI

    User->>MCPClient: Configure and invoke search_literature
    MCPClient->>LiteratureMCP: JSON-RPC tool call
    LiteratureMCP->>Tool: search_literature(query, filters,...)

    Tool->>Tool: Parse and coerce parameters
    Tool->>Tool: Build SearchRequest

    Tool->>Distributor: create WorkDistributor(openalex_email)
    Tool->>Distributor: search(request: SearchRequest)

    loop For each data source
        alt openalex enabled
            Distributor->>OAClient: search(request)

            OAClient->>OAClient: _build_query_params(...)

            par Resolve optional IDs
                OAClient->>OpenAlexAPI: GET /authors?search=author
                OpenAlexAPI-->>OAClient: author results
                OAClient->>OpenAlexAPI: GET /institutions?search=institution
                OpenAlexAPI-->>OAClient: institution results
                OAClient->>OpenAlexAPI: GET /sources?search=source
                OpenAlexAPI-->>OAClient: source results
            end

            loop Paginated works fetch
                OAClient->>OpenAlexAPI: GET /works?page=n&per-page=200
                OpenAlexAPI-->>OAClient: works page
                OAClient->>OAClient: _request_with_retry(...) with backoff
            end

            OAClient->>Distributor: works, warnings
        else unsupported source
            Distributor->>Distributor: record warning
        end
    end

    Distributor->>Distributor: deduplicate_by_doi(all_works)
    Distributor->>Distributor: _sort_works(...)
    Distributor-->>Tool: aggregated_result

    Tool->>Tool: _format_search_result(...)
    Tool-->>LiteratureMCP: Markdown report string
    LiteratureMCP-->>MCPClient: Tool result
    MCPClient-->>User: Rendered report + JSON works
Loading

Class diagram for the new literature search subsystem

classDiagram

    class SearchRequest {
        +str query
        +str author
        +str institution
        +str source
        +int year_from
        +int year_to
        +bool is_oa
        +str work_type
        +str language
        +bool is_retracted
        +bool has_abstract
        +bool has_fulltext
        +str sort_by
        +int max_results
        +list~str~ data_sources
    }

    class LiteratureWork {
        +str id
        +str doi
        +str title
        +list~dict~ authors
        +int publication_year
        +int cited_by_count
        +str abstract
        +str journal
        +bool is_oa
        +str oa_url
        +str source
        +dict raw_data
    }

    class BaseLiteratureClient {
        <<abstract>>
        +search(request: SearchRequest) tuple~list~LiteratureWork~~ list~str~~
    }

    class OpenAlexClient {
        -str BASE_URL
        -int MAX_PER_PAGE
        -int MAX_RETRIES
        -float TIMEOUT
        -str email
        -int rate_limit
        -_RateLimiter rate_limiter
        -AsyncClient client
        +OpenAlexClient(email: str, rate_limit: int, timeout: float)
        +str pool_type
        +search(request: SearchRequest) tuple~list~LiteratureWork~~ list~str~~
        -_build_query_params(request: SearchRequest, author_id: str, institution_id: str, source_id: str) dict~str, str~
        -_resolve_author_id(author_name: str) tuple~str, bool, str~
        -_resolve_institution_id(institution_name: str) tuple~str, bool, str~
        -_resolve_source_id(source_name: str) tuple~str, bool, str~
        -_fetch_all_pages(params: dict~str, str~, max_results: int) list~dict~
        -_request_with_retry(url: str, params: dict~str, str~) dict~str, any~
        -_transform_work(work: dict~str, any~) LiteratureWork
        -_reconstruct_abstract(inverted_index: dict~str, list~int~~) str
        +close() None
    }

    class _RateLimiter {
        -float _min_interval
        -Lock _lock
        -float _last_request
        -Semaphore _semaphore
        +_RateLimiter(rate_per_second: float, max_concurrency: int)
        +__aenter__() None
        +__aexit__(exc_type: type, exc: BaseException, tb: any) None
        -_throttle() None
    }

    class WorkWithDOI {
        <<interface>>
        +str doi
        +int cited_by_count
        +int publication_year
    }

    class WorkDistributor {
        -dict~str, any~ clients
        -str openalex_email
        +WorkDistributor(openalex_email: str)
        +search(request: SearchRequest) dict~str, any~
        +close() None
        -_register_clients() None
        -_sort_works(works: list~LiteratureWork~, sort_by: str) list~LiteratureWork~
    }

    class doi_cleaner_utils {
        <<utility>>
        +normalize_doi(doi: str) str
        +deduplicate_by_doi(works: list~WorkWithDOI~) list~WorkWithDOI~
    }

    BaseLiteratureClient <|-- OpenAlexClient
    WorkWithDOI <|.. LiteratureWork

    WorkDistributor --> BaseLiteratureClient : holds clients
    WorkDistributor --> LiteratureWork : aggregates
    WorkDistributor --> SearchRequest : consumes

    OpenAlexClient --> SearchRequest : consumes
    OpenAlexClient --> LiteratureWork : produces

    doi_cleaner_utils ..> WorkWithDOI : operates_on
    WorkDistributor ..> doi_cleaner_utils : uses
Loading

File-Level Changes

Change Details Files
Add an OpenAlex literature client implementing rate-limited, retrying search and lookup with best-practice query construction and transformation into internal models.
  • Implement OpenAlexClient with configurable polite/default pools, global rate limiting, and async httpx client management
  • Add search workflow that resolves author/institution/source names to IDs, builds OpenAlex filters/sorting, paginates through works, and transforms responses to LiteratureWork objects
  • Provide helpers for reconstructing abstracts from inverted index, transforming raw works, and robust request-with-retry handling for timeouts, 4xx/5xx, and rate limits
service/app/utils/literature/openalex_client.py
service/tests/unit/test_utils/test_openalex_client.py
Introduce shared literature data models, base client abstraction, and a work distributor to coordinate multi-source searches with deduplication and sorting.
  • Define SearchRequest and LiteratureWork dataclasses as a unified schema across providers
  • Create BaseLiteratureClient abstract base with standard async search/get methods and tests via a concrete stub
  • Implement WorkDistributor to register provider clients (currently OpenAlex), clamp max_results, dispatch searches, aggregate results, deduplicate by DOI, sort, and expose async context/close semantics
service/app/utils/literature/models.py
service/app/utils/literature/base_client.py
service/app/utils/literature/work_distributor.py
service/tests/unit/test_utils/test_base_client.py
service/tests/unit/test_utils/test_work_distributor.py
Add DOI normalization and deduplication utilities that select best representative works and are used by the distributor.
  • Implement normalize_doi to strip common prefixes, validate DOI structure, and lowercase for consistent matching
  • Implement deduplicate_by_doi that groups works by normalized DOI, prefers higher citations and newer year, and preserves non-DOI works, plus extensive unit tests
service/app/utils/literature/doi_cleaner.py
service/tests/unit/test_utils/test_doi_cleaner.py
Expose the literature search capability as an MCP server tool that validates/coerces user parameters, invokes the distributor, and returns a structured markdown+JSON report.
  • Create FastMCP-based literature MCP with search_literature tool that parses string/primitive params (years, booleans, max_results), constructs SearchRequest, and wires WorkDistributor with optional OpenAlex polite-pool email
  • Format search results into a markdown report including conditions, stats, warnings, suggestions, and an embedded JSON list of works with optional abstracts, plus error handling for validation/network issues
service/app/mcp/literature.py
Document the literature MCP system and its usage, parameters, and OpenAlex best practices.
  • Add README describing multi-source literature MCP features, typical usage patterns, filter parameters, and OpenAlex-specific guidance and anti-patterns
service/app/utils/literature/README.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a multi-source academic literature search MCP (Model Context Protocol) service with OpenAlex integration. The implementation provides a unified interface for searching academic literature with advanced filtering capabilities including author, institution, journal, year range, and work type filters. The system includes intelligent DOI-based deduplication, rate limiting, retry logic with exponential backoff, and comprehensive test coverage.

Changes:

  • Added literature search utilities with OpenAlex API client implementation
  • Implemented DOI normalization and deduplication logic
  • Created MCP server endpoints for literature search
  • Added comprehensive unit tests for all components
  • Included detailed documentation in README

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
service/app/utils/literature/base_client.py Abstract base class defining the client protocol
service/app/utils/literature/models.py Data models for search requests and literature works
service/app/utils/literature/doi_cleaner.py DOI normalization and deduplication utilities
service/app/utils/literature/openalex_client.py OpenAlex API client with rate limiting and retry logic
service/app/utils/literature/work_distributor.py Work distributor coordinating multiple data sources
service/app/utils/literature/init.py Package initialization and exports
service/app/mcp/literature.py MCP server implementation with search endpoints
service/app/utils/literature/README.md Comprehensive documentation with examples
service/tests/unit/test_utils/test_base_client.py Unit tests for base client and models
service/tests/unit/test_utils/test_doi_cleaner.py Unit tests for DOI utilities
service/tests/unit/test_utils/test_openalex_client.py Unit tests for OpenAlex client
service/tests/unit/test_utils/test_work_distributor.py Unit tests for work distributor
.gitignore Updated to exclude MCP documentation and VS folder

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

source="openalex",
)

# DataclassesObjects with same values should be equal
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "# DataclassesObjects with same values should be equal" contains a typo. It should be "Dataclass objects" (two words, not one).

Suggested change
# DataclassesObjects with same values should be equal
# Dataclass objects with same values should be equal

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +59
now = asyncio.get_running_loop().time()
wait_time = self._last_request + self._min_interval - now
if wait_time > 0:
await asyncio.sleep(wait_time)
self._last_request = asyncio.get_running_loop().time()
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rate limiter's _throttle method uses asyncio.get_running_loop().time() twice (lines 55 and 59), which could result in a slight drift in timing. Consider storing the time after sleep in a single call to ensure accuracy.

Suggested change
now = asyncio.get_running_loop().time()
wait_time = self._last_request + self._min_interval - now
if wait_time > 0:
await asyncio.sleep(wait_time)
self._last_request = asyncio.get_running_loop().time()
loop = asyncio.get_running_loop()
now = loop.time()
wait_time = self._last_request + self._min_interval - now
if wait_time > 0:
await asyncio.sleep(wait_time)
now = loop.time()
self._last_request = now

Copilot uses AI. Check for mistakes.
else:
raise

raise Exception(f"Failed after {self.MAX_RETRIES} retries")
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop variable attempt is used but starts at 0, which means the log messages will show "attempt 1" for index 0, "attempt 2" for index 1, etc. When the final retry fails, the error message says "Failed after {MAX_RETRIES} retries" which would be 5, but only 5 attempts were made (indices 0-4), not 5 retries after the first attempt. Consider clarifying the terminology or adjusting the logic to be more accurate.

Suggested change
raise Exception(f"Failed after {self.MAX_RETRIES} retries")
raise Exception(f"Failed after {self.MAX_RETRIES} attempts")

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +134
max_year = datetime.now().year + 1
year_warning = ""
if year_from_int is not None and year_from_int > max_year:
year_warning += f"year_from {year_from_int}→{max_year}. "
year_from_int = max_year
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The year validation allows year_from values up to datetime.now().year + 1, which means future years are accepted. While this may be intentional for pre-prints or advance publications, it creates a potential issue where searches can include dates in the future. Consider whether this is the desired behavior or if validation should be stricter.

Copilot uses AI. Check for mistakes.
Comment on lines +206 to +208
except Exception as e:
logger.error(f"Literature search failed: {e}", exc_info=True)
return "❌ Unexpected error during search. Please retry or contact support."
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling catches generic Exception and returns a user-friendly message, but this could mask important errors. Consider logging the full exception details while still providing a user-friendly message, or being more specific about which exceptions to catch.

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +559
class OpenAlexClient(BaseLiteratureClient):
"""
OpenAlex API client

Implements best practices from official API guide for LLMs:
https://docs.openalex.org/api-guide-for-llms
"""

BASE_URL = "https://api.openalex.org"
MAX_PER_PAGE = 200
MAX_RETRIES = 5
TIMEOUT = 30.0

def __init__(self, email: str | None, rate_limit: int | None = None, timeout: float = 30.0) -> None:
"""
Initialize OpenAlex client

Args:
email: Email for polite pool (10x rate limit increase). If None, use default pool.
rate_limit: Requests per second (default: 10 with email, 1 without email)
timeout: Request timeout in seconds (default: 30.0)
"""
self.email = email
self.rate_limit = rate_limit or (10 if self.email else 1)
max_concurrency = 10 if self.email else 1
self.rate_limiter = _RateLimiter(rate_per_second=self.rate_limit, max_concurrency=max_concurrency)
self.client = httpx.AsyncClient(timeout=timeout)
pool_type = "polite" if self.email else "default"
logger.info(
f"OpenAlex client initialized with pool={pool_type}, email={self.email}, rate_limit={self.rate_limit}/s"
)

@property
def pool_type(self) -> str:
"""Return pool type string."""
return "polite" if self.email else "default"

async def search(self, request: SearchRequest) -> tuple[list[LiteratureWork], list[str]]:
"""
Execute search and return results in standard format

Implementation steps:
1. Convert author name -> author ID (if specified)
2. Convert institution name -> institution ID (if specified)
3. Convert journal name -> source ID (if specified)
4. Build filter query
5. Paginate through results
6. Transform to standard format

Args:
request: Standardized search request

Returns:
Tuple of (works, warnings)
- works: List of literature works in standard format
- warnings: List of warning/info messages for LLM feedback
"""
logger.info(
f"OpenAlex search [{self.pool_type} @ {self.rate_limit}/s]: query='{request.query}', max_results={request.max_results}"
)

warnings: list[str] = []

# Step 1-3: Resolve IDs for names (two-step lookup pattern)
author_id = None
if request.author:
author_id, _success, msg = await self._resolve_author_id(request.author)
warnings.append(msg)

institution_id = None
if request.institution:
institution_id, _success, msg = await self._resolve_institution_id(request.institution)
warnings.append(msg)

source_id = None
if request.source:
source_id, _success, msg = await self._resolve_source_id(request.source)
warnings.append(msg)

# Step 4: Build query parameters
params = self._build_query_params(request, author_id, institution_id, source_id)

# Step 5: Fetch all pages
works = await self._fetch_all_pages(params, request.max_results)

# Step 6: Transform to standard format
return [self._transform_work(w) for w in works], warnings

def _build_query_params(
self,
request: SearchRequest,
author_id: str | None,
institution_id: str | None,
source_id: str | None,
) -> dict[str, str]:
"""
Build OpenAlex query parameters

Args:
request: Search request
author_id: Resolved author ID (if any)
institution_id: Resolved institution ID (if any)
source_id: Resolved source ID (if any)

Returns:
Dictionary of query parameters
"""
params: dict[str, str] = {
"per-page": str(self.MAX_PER_PAGE),
}

if self.email:
params["mailto"] = self.email

# Search keywords
if request.query:
params["search"] = request.query

# Build filters
filters: list[str] = []

if author_id:
filters.append(f"authorships.author.id:{author_id}")

if institution_id:
filters.append(f"authorships.institutions.id:{institution_id}")

if source_id:
filters.append(f"primary_location.source.id:{source_id}")

# Year range
if request.year_from or request.year_to:
if request.year_from and request.year_to:
filters.append(f"publication_year:{request.year_from}-{request.year_to}")
elif request.year_from:
filters.append(f"publication_year:>{request.year_from - 1}")
elif request.year_to:
filters.append(f"publication_year:<{request.year_to + 1}")

# Open access filter
if request.is_oa is not None:
filters.append(f"is_oa:{str(request.is_oa).lower()}")

# Work type filter
if request.work_type:
filters.append(f"type:{request.work_type}")

# Language filter
if request.language:
filters.append(f"language:{request.language}")

# Retracted filter
if request.is_retracted is not None:
filters.append(f"is_retracted:{str(request.is_retracted).lower()}")

# Abstract filter
if request.has_abstract is not None:
filters.append(f"has_abstract:{str(request.has_abstract).lower()}")

# Fulltext filter
if request.has_fulltext is not None:
filters.append(f"has_fulltext:{str(request.has_fulltext).lower()}")

if filters:
params["filter"] = ",".join(filters)

# Sorting
sort_map = {
"relevance": None, # Default sorting by relevance
"cited_by_count": "cited_by_count:desc",
"publication_date": "publication_date:desc",
}
if sort := sort_map.get(request.sort_by):
params["sort"] = sort

return params

async def _resolve_author_id(self, author_name: str) -> tuple[str | None, bool, str]:
"""
Two-step lookup: author name -> author ID

Args:
author_name: Author name to search

Returns:
Tuple of (author_id, success, message)
- author_id: Author ID (e.g., "A5023888391") or None if not found
- success: Whether resolution was successful
- message: Status message for LLM feedback
"""
async with self.rate_limiter:
try:
url = f"{self.BASE_URL}/authors"
params: dict[str, str] = {"search": author_name}
if self.email:
params["mailto"] = self.email
response = await self._request_with_retry(url, params)

if results := response.get("results", []):
# Return first result's ID in short format
author_id = results[0]["id"].split("/")[-1]
author_display = results[0].get("display_name", author_name)
logger.info(f"Resolved author '{author_name}' -> {author_id}")
return author_id, True, f"✓ Author resolved: '{author_name}' -> '{author_display}'"
else:
msg = (
f"⚠️ Author '{author_name}' not found. "
f"Suggestions: (1) Try full name format like 'Smith, John' or 'John Smith', "
f"(2) Check spelling, (3) Try removing middle name/initial."
)
logger.warning(msg)
return None, False, msg
except Exception as e:
msg = f"⚠️ Failed to resolve author '{author_name}': {e}"
logger.warning(msg)
return None, False, msg

async def _resolve_institution_id(self, institution_name: str) -> tuple[str | None, bool, str]:
"""
Two-step lookup: institution name -> institution ID

Args:
institution_name: Institution name to search

Returns:
Tuple of (institution_id, success, message)
- institution_id: Institution ID (e.g., "I136199984") or None if not found
- success: Whether resolution was successful
- message: Status message for LLM feedback
"""
async with self.rate_limiter:
try:
url = f"{self.BASE_URL}/institutions"
params: dict[str, str] = {"search": institution_name}
if self.email:
params["mailto"] = self.email
response = await self._request_with_retry(url, params)

if results := response.get("results", []):
institution_id = results[0]["id"].split("/")[-1]
inst_display = results[0].get("display_name", institution_name)
logger.info(f"Resolved institution '{institution_name}' -> {institution_id}")
return institution_id, True, f"✓ Institution resolved: '{institution_name}' -> '{inst_display}'"
else:
msg = (
f"⚠️ Institution '{institution_name}' not found. "
f"Suggestions: (1) Use full official name (e.g., 'Harvard University' not 'Harvard'), "
f"(2) Try variations (e.g., 'MIT' vs 'Massachusetts Institute of Technology'), "
f"(3) Check spelling."
)
logger.warning(msg)
return None, False, msg
except Exception as e:
msg = f"⚠️ Failed to resolve institution '{institution_name}': {e}"
logger.warning(msg)
return None, False, msg

async def _resolve_source_id(self, source_name: str) -> tuple[str | None, bool, str]:
"""
Two-step lookup: source name -> source ID

Args:
source_name: Journal/conference name to search

Returns:
Tuple of (source_id, success, message)
- source_id: Source ID (e.g., "S137773608") or None if not found
- success: Whether resolution was successful
- message: Status message for LLM feedback
"""
async with self.rate_limiter:
try:
url = f"{self.BASE_URL}/sources"
params: dict[str, str] = {"search": source_name}
if self.email:
params["mailto"] = self.email
response = await self._request_with_retry(url, params)

if results := response.get("results", []):
source_id = results[0]["id"].split("/")[-1]
source_display = results[0].get("display_name", source_name)
logger.info(f"Resolved source '{source_name}' -> {source_id}")
return source_id, True, f"✓ Source resolved: '{source_name}' -> '{source_display}'"
else:
msg = (
f"⚠️ Source/Journal '{source_name}' not found. "
f"Suggestions: (1) Use full journal name (e.g., 'Nature' or 'Science'), "
f"(2) Try alternative names (e.g., 'JAMA' vs 'Journal of the American Medical Association'), "
f"(3) Check spelling."
)
logger.warning(msg)
return None, False, msg
except Exception as e:
msg = f"⚠️ Failed to resolve source '{source_name}': {e}"
logger.warning(msg)
return None, False, msg

async def _fetch_all_pages(self, params: dict[str, str], max_results: int) -> list[dict[str, Any]]:
"""
Paginate through all results up to max_results

Args:
params: Base query parameters
max_results: Maximum number of results to fetch

Returns:
List of work objects from API
"""
all_works: list[dict[str, Any]] = []
page = 1

while len(all_works) < max_results:
async with self.rate_limiter:
try:
url = f"{self.BASE_URL}/works"
page_params = {**params, "page": str(page)}
response = await self._request_with_retry(url, page_params)

works = response.get("results", [])
if not works:
break

all_works.extend(works)
logger.info(f"Fetched page {page}: {len(works)} works")

# Check if there are more pages
meta = response.get("meta", {})
total_count = meta.get("count", 0)
if len(all_works) >= total_count:
break

page += 1

except Exception as e:
logger.error(f"Error fetching page {page}: {e}")
break

return all_works[:max_results]

async def _request_with_retry(self, url: str, params: dict[str, str]) -> dict[str, Any]:
"""
HTTP request with exponential backoff retry

Implements best practices:
- Retry on 403 (rate limit) with exponential backoff
- Retry on 5xx (server error) with exponential backoff
- Don't retry on 4xx (except 403)
- Retry on timeout

Args:
url: Request URL
params: Query parameters

Returns:
JSON response

Raises:
Exception: If all retries fail
"""
for attempt in range(self.MAX_RETRIES):
try:
response = await self.client.get(url, params=params)

if response.status_code == 200:
return response.json()
elif response.status_code == 403:
# Rate limited
wait_time = 2**attempt
logger.warning(f"Rate limited (403), waiting {wait_time}s... (attempt {attempt + 1})")
await asyncio.sleep(wait_time)
elif response.status_code >= 500:
# Server error
wait_time = 2**attempt
logger.warning(
f"Server error ({response.status_code}), waiting {wait_time}s... (attempt {attempt + 1})"
)
await asyncio.sleep(wait_time)
else:
# Other error, don't retry
response.raise_for_status()

except httpx.TimeoutException:
if attempt < self.MAX_RETRIES - 1:
wait_time = 2**attempt
logger.warning(f"Timeout, retrying in {wait_time}s... (attempt {attempt + 1})")
await asyncio.sleep(wait_time)
else:
raise
except Exception as e:
logger.error(f"Request failed: {e}")
if attempt < self.MAX_RETRIES - 1:
wait_time = 2**attempt
await asyncio.sleep(wait_time)
else:
raise

raise Exception(f"Failed after {self.MAX_RETRIES} retries")

def _transform_work(self, work: dict[str, Any]) -> LiteratureWork:
"""
Transform OpenAlex work data to standard format

Args:
work: Raw work object from OpenAlex API

Returns:
Standardized LiteratureWork object
"""
# Extract authors
authors: list[dict[str, str | None]] = []
for authorship in work.get("authorships", []):
author = authorship.get("author", {})
authors.append(
{
"name": author.get("display_name", "Unknown"),
"id": author.get("id", "").split("/")[-1] if author.get("id") else None,
}
)

# Extract journal/source
journal = None
if primary_location := work.get("primary_location"):
if source := primary_location.get("source"):
journal = source.get("display_name")

# Extract open access info
oa_info = work.get("open_access", {})
is_oa = oa_info.get("is_oa", False)
oa_url = oa_info.get("oa_url")

# Extract abstract (reconstruct from inverted index)
abstract = self._reconstruct_abstract(work.get("abstract_inverted_index"))

# Extract DOI (remove prefix)
doi = None
if doi_raw := work.get("doi"):
doi = doi_raw.replace("https://doi.org/", "")

return LiteratureWork(
id=work["id"].split("/")[-1],
doi=doi,
title=work.get("title", "Untitled"),
authors=authors,
publication_year=work.get("publication_year"),
cited_by_count=work.get("cited_by_count", 0),
abstract=abstract,
journal=journal,
is_oa=is_oa,
oa_url=oa_url,
source="openalex",
raw_data=work,
)

def _reconstruct_abstract(self, inverted_index: dict[str, list[int]] | None) -> str | None:
"""
Reconstruct abstract from inverted index

OpenAlex stores abstracts as inverted index for efficiency.
Format: {"word": [position1, position2, ...], ...}

Args:
inverted_index: Inverted index from OpenAlex

Returns:
Reconstructed abstract text or None

Examples:
>>> index = {"Hello": [0], "world": [1], "!": [2]}
>>> _reconstruct_abstract(index)
"Hello world !"
"""
if not inverted_index:
return None

# Expand inverted index to (position, word) pairs
word_positions: list[tuple[int, str]] = []
for word, positions in inverted_index.items():
for pos in positions:
word_positions.append((pos, word))

# Sort by position and join
word_positions.sort()
return " ".join(word for _, word in word_positions)

async def close(self) -> None:
"""Close the HTTP client"""
await self.client.aclose()

async def __aenter__(self) -> "OpenAlexClient":
return self

async def __aexit__(
self,
exc_type: type[BaseException] | None,
exc: BaseException | None,
tb: Any | None,
) -> None:
await self.close()
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenAlexClient class is missing implementations for get_by_doi and get_by_id methods that are tested in the test file. These methods need to be implemented to fulfill the BaseLiteratureClient protocol.

Copilot uses AI. Check for mistakes.
if request.year_from and request.year_to:
filters.append(f"publication_year:{request.year_from}-{request.year_to}")
elif request.year_from:
filters.append(f"publication_year:>{request.year_from - 1}")
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the year_from condition on line 197, when only year_from is specified, the filter uses >{year_from - 1} which would include the year_from itself. However, this creates an off-by-one risk if the OpenAlex API interprets this as "strictly greater than". Consider using >={year_from} for clarity and correctness, or verify the API documentation confirms that >{year_from - 1} is the intended approach.

Suggested change
filters.append(f"publication_year:>{request.year_from - 1}")
filters.append(f"publication_year:>={request.year_from}")

Copilot uses AI. Check for mistakes.
elif request.year_from:
filters.append(f"publication_year:>{request.year_from - 1}")
elif request.year_to:
filters.append(f"publication_year:<{request.year_to + 1}")
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, in the year_to condition on line 199, the filter uses <{year_to + 1} which may be confusing. Consider using <={year_to} for clarity if the API supports it.

Suggested change
filters.append(f"publication_year:<{request.year_to + 1}")
filters.append(f"publication_year:<={request.year_to}")

Copilot uses AI. Check for mistakes.
Comment on lines +135 to +137
if year_to_int is not None and year_to_int < 1700:
year_warning += f"year_to {year_to_int}→1700. "
year_to_int = 1700
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The year_to validation sets a minimum of 1700, but this magic number should be documented or made a constant. Additionally, consider whether 1700 is the appropriate minimum year for academic literature searches.

Copilot uses AI. Check for mistakes.
Comment on lines +195 to +196
result.setdefault("warnings", []).append(f"⚠️ Year adjusted: {year_warning.strip()}")

Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message formats year adjustments but doesn't indicate which parameter was adjusted when both year_from and year_to are adjusted. Consider making the warning message clearer about which specific parameter was changed.

Suggested change
result.setdefault("warnings", []).append(f"⚠️ Year adjusted: {year_warning.strip()}")
year_adjustments: list[str] = []
if year_from is not None and year_from_int is not None and str(year_from) != str(year_from_int):
year_adjustments.append(f"year_from adjusted from {year_from} to {year_from_int}")
if year_to is not None and year_to_int is not None and str(year_to) != str(year_to_int):
year_adjustments.append(f"year_to adjusted from {year_to} to {year_to_int}")
if year_adjustments:
warning_message = f"⚠️ Year adjusted: {'; '.join(year_adjustments)}"
else:
warning_message = f"⚠️ Year adjusted: {year_warning.strip()}"
result.setdefault("warnings", []).append(warning_message)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - 我发现了 5 个问题,并且留下了一些高层次的反馈:

  • OpenAlexClient._request_with_retry 中,除了 403 以外的 4xx 响应会执行 response.raise_for_status(),然后被通用的 except Exception 捕获,导致发生重试;但注释里说 4xx(403 除外)不应该重试——建议对不可重试的状态码显式处理 httpx.HTTPStatusError,让它们快速失败。
  • 测试和 ConcreteClient 假设所有文献客户端都有通用的 get_by_doi/get_by_id 接口,但 BaseLiteratureClient 只把 search 定义为抽象方法,而 OpenAlexClient 并没有实现这些方法;更清晰的做法要么是在 BaseLiteratureClient 上把这些方法也定义为抽象方法并在 OpenAlexClient 中实现,要么调整测试和辅助函数,只依赖共享接口。
给 AI Agent 的提示
请根据下面的代码审查意见进行修改:

## 总体意见
-`OpenAlexClient._request_with_retry` 中,除了 403 以外的 4xx 响应会执行 `response.raise_for_status()`,然后被通用的 `except Exception` 捕获,导致发生重试;但注释里说 4xx(403 除外)不应该重试——建议对不可重试的状态码显式处理 `httpx.HTTPStatusError`,让它们快速失败。
- 测试和 `ConcreteClient` 假设所有文献客户端都有通用的 `get_by_doi`/`get_by_id` 接口,但 `BaseLiteratureClient` 只把 `search` 定义为抽象方法,而 `OpenAlexClient` 并没有实现这些方法;更清晰的做法要么是在 `BaseLiteratureClient` 上把这些方法也定义为抽象方法并在 `OpenAlexClient` 中实现,要么调整测试和辅助函数,只依赖共享接口。

## 逐条评论

### 评论 1
<location> `service/app/utils/literature/openalex_client.py:266-275` </location>
<code_context>
+    sections.append("## Complete Works List (JSON)\n")
+    if include_abstract:
+        sections.append("The following JSON contains all works with full abstracts:\n")
+    else:
+        sections.append("The following JSON contains all works (abstracts excluded to save tokens):\n")
+    sections.append("```json")
</code_context>

<issue_to_address>
**issue (bug_risk):** 4xx 错误(403 以外)会被重试,与注释所述行为不符

在 `_request_with_retry` 中,非 403 的 4xx 响应会触发 `response.raise_for_status()`,从而抛出 `httpx.HTTPStatusError`,然后被宽泛的 `except Exception` 捕获,导致在 `attempt < self.MAX_RETRIES - 1` 之前持续重试。这与“不要在 4xx(403 除外)上重试”的预期行为相矛盾,并且会浪费配额 / 增加延迟。建议显式处理 `httpx.HTTPStatusError`,对于 4xx(403 除外)立即重新抛出,只对 5xx/403/超时应用退避重试。
</issue_to_address>

### 评论 2
<location> `service/app/utils/literature/openalex_client.py:377-378` </location>
<code_context>
+            async with self.rate_limiter:
+                try:
+                    url = f"{self.BASE_URL}/works"
+                    page_params = {**params, "page": str(page)}
+                    response = await self._request_with_retry(url, page_params)
+
+                    works = response.get("results", [])
</code_context>

<issue_to_address>
**suggestion (performance):** 即使 `max_results` 很小,分页大小仍然始终为 MAX_PER_PAGE,导致不必要的数据传输

在 `_fetch_all_pages` 中,每个请求都使用 `per-page=MAX_PER_PAGE`(200),因此即便 `max_results` 很小(例如 10),你也会获取 200 条记录然后再做切片。为避免不必要的带宽和延迟,建议每次请求将 `per-page` 调整为 `min(self.MAX_PER_PAGE, max_results - len(all_works))`,这样最后一页(或唯一一页)只获取所需数量的数据。

建议的实现:

```python
        all_works: list[dict[str, Any]] = []
        page = 1

        while len(all_works) < max_results:
            remaining_results = max_results - len(all_works)
            per_page = min(self.MAX_PER_PAGE, remaining_results)

            async with self.rate_limiter:

```

```python
                    url = f"{self.BASE_URL}/works"
                    page_params = {
                        **params,
                        "page": str(page),
                        "per-page": str(per_page),
                    }

```

如果 `params` 在该方法更早的位置(或由调用方)构造时包含了固定的 `"per-page"` 值,请确保:
1. 要么从基础 `params` 中移除该固定的 `"per-page"`,完全依赖这里的覆盖;
2. 要么你确实希望在每次请求时由这里的 `"per-page"` 动态覆盖基础值(上述代码默认假定覆盖是可以接受的)。
</issue_to_address>

### 评论 3
<location> `service/tests/unit/test_utils/test_openalex_client.py:338-352` </location>
<code_context>
+        assert params["per-page"] == "200"
+        assert params["mailto"] == "test@example.com"
+
+    def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
+        """Test building query parameters with filters."""
+        request = SearchRequest(
+            query="machine learning",
+            year_from=2015,
+            year_to=2021,
+            is_oa=True,
+            work_type="journal-article",
+        )
+        params = client._build_query_params(request, None, None, None)
+
+        assert "filter" in params
+        assert "publication_year:2015-2021" in params["filter"]
+        assert "is_oa:true" in params["filter"]
+        assert "type:journal-article" in params["filter"]
+
+    def test_build_query_params_with_resolved_ids(self, client: OpenAlexClient) -> None:
</code_context>

<issue_to_address>
**suggestion (testing):**`_build_query_params` 中的语言、撤稿、摘要和全文过滤器增加测试覆盖

`_build_query_params` 已经处理了 `language``is_retracted``has_abstract``has_fulltext`,但测试目前只覆盖了 `publication_year``is_oa``work_type`。请新增或扩展一个测试,在 `SearchRequest` 上设置这四个字段,并断言 `filter` 字符串中包含 `language:<code>``is_retracted:<bool>``has_abstract:<bool>``has_fulltext:<bool>`,以避免回归。

```suggestion
    def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
        """Test building query parameters with filters."""
        request = SearchRequest(
            query="machine learning",
            year_from=2015,
            year_to=2021,
            is_oa=True,
            work_type="journal-article",
            language="en",
            is_retracted=False,
            has_abstract=True,
            has_fulltext=False,
        )
        params = client._build_query_params(request, None, None, None)

        assert "filter" in params
        assert "publication_year:2015-2021" in params["filter"]
        assert "is_oa:true" in params["filter"]
        assert "type:journal-article" in params["filter"]
        assert "language:en" in params["filter"]
        assert "is_retracted:false" in params["filter"]
        assert "has_abstract:true" in params["filter"]
        assert "has_fulltext:false" in params["filter"]
```
</issue_to_address>

### 评论 4
<location> `service/tests/unit/test_utils/test_openalex_client.py:49-58` </location>
<code_context>
+class TestOpenAlexClientSearch:
</code_context>

<issue_to_address>
**suggestion (testing):** 考虑为 `_fetch_all_pages` 的分页行为和终止条件添加测试

当前测试通过模拟 `_request_with_retry` 返回单页结果,因此没有真正覆盖 `_fetch_all_pages` 的分页逻辑。请添加测试以:
- 模拟多页结果,并断言返回的文献总数;
- 验证在累计结果达到 `meta.count` 时的提前终止路径;
- 覆盖 `_request_with_retry` 抛出异常时循环退出的错误路径。
这些测试可以直接针对 `_fetch_all_pages`(通过模拟 `_request_with_retry`)编写,也可以通过设置合适的 mock 从 `search` 入口调用。

建议实现:

```python
class TestOpenAlexClientSearch:
    """Test OpenAlex search functionality."""

```

请将以下测试添加到 `class TestOpenAlexClientSearch` 中,放在现有 fixture 和与 `search` 相关的测试之后:

```python
    def test_fetch_all_pages_multiple_pages(self, client, mocker) -> None:
        """_fetch_all_pages returns combined results from multiple pages."""
        # Two pages of 2 results each
        page1 = {
            "results": [{"id": "W1"}, {"id": "W2"}],
            "meta": {"count": 4, "page": 1, "per_page": 2},
        }
        page2 = {
            "results": [{"id": "W3"}, {"id": "W4"}],
            "meta": {"count": 4, "page": 2, "per_page": 2},
        }

        mock_request = mocker.patch.object(
            client,
            "_request_with_retry",
            side_effect=[page1, page2],
        )

        works = list(
            client._fetch_all_pages(
                endpoint="works",
                params={"search": "test"},
                per_page=2,
                max_results=10,
            )
        )

        # All 4 works from 2 pages should be returned
        assert [w["id"] for w in works] == ["W1", "W2", "W3", "W4"]
        assert mock_request.call_count == 2

    def test_fetch_all_pages_terminates_at_meta_count(self, client, mocker) -> None:
        """_fetch_all_pages stops when accumulated results reach meta.count."""
        # meta.count indicates only 3 items even if pages could return more
        page1 = {
            "results": [{"id": "W1"}, {"id": "W2"}],
            "meta": {"count": 3, "page": 1, "per_page": 2},
        }
        page2 = {
            "results": [{"id": "W3"}, {"id": "W4"}],
            "meta": {"count": 3, "page": 2, "per_page": 2},
        }

        mock_request = mocker.patch.object(
            client,
            "_request_with_retry",
            side_effect=[page1, page2],
        )

        works = list(
            client._fetch_all_pages(
                endpoint="works",
                params={"search": "test"},
                per_page=2,
                max_results=10,
            )
        )

        # Only first 3 results should be yielded even though second page has 2
        assert [w["id"] for w in works] == ["W1", "W2", "W3"]
        assert mock_request.call_count == 2

    def test_fetch_all_pages_stops_on_request_error(self, client, mocker) -> None:
        """_fetch_all_pages exits when a request error occurs."""
        page1 = {
            "results": [{"id": "W1"}],
            "meta": {"count": 10, "page": 1, "per_page": 1},
        }

        def side_effect(*args, **kwargs):
            # First call returns a valid page, second call raises
            if side_effect.call_count == 0:
                side_effect.call_count += 1
                return page1
            raise RuntimeError("API error")

        side_effect.call_count = 0

        mock_request = mocker.patch.object(
            client,
            "_request_with_retry",
            side_effect=side_effect,
        )

        # If _fetch_all_pages propagates the error, this asserts the error path.
        # If instead it swallows the error and returns partial results, change this
        # to assert the returned list and that call_count == 2.
        with pytest.raises(RuntimeError):
            list(
                client._fetch_all_pages(
                    endpoint="works",
                    params={"search": "test"},
                    per_page=1,
                    max_results=10,
                )
            )

        assert mock_request.call_count == 2
```

需要与现有实现核对的一些注意事项:

1. **函数签名**:这些测试假设 `_fetch_all_pages` 的签名类似于:
   ```python
   def _fetch_all_pages(self, endpoint: str, params: dict, per_page: int, max_results: int | None = None) -> Iterable[dict]:
       ...
   ```
   如果实际签名不同(例如参数名或位置不同),请相应调整调用方式。

2. **错误处理语义**- 如果 `_fetch_all_pages` 会“吞掉” `_request_with_retry` 抛出的异常并返回部分结果,请将 `pytest.raises` 替换为对部分 `works` 列表的断言(例如只包含 `["W1"]`),并保留对 `call_count` 的断言;
   - 如果它总是立刻向外传播异常而不会进行第二次调用,请把期望的 `call_count` 调整为 `1`3. **测试工具**- 这些测试使用了 `pytest-mock` 提供的 `mocker` fixture,这通常与 pytest 一起使用。如果项目直接使用 `unittest.mock`,请将 `mocker.patch.object(...)` 替换为 `with mock.patch.object(...)` 或等价的写法。

一旦根据 `_fetch_all_pages` 的真实行为和签名完成整合,这些测试将:
- 覆盖多页结果聚合;
- 验证在达到 `meta.count` 时的提前终止;
- 覆盖 `_request_with_retry` 抛出异常时的错误路径。
</issue_to_address>

### 评论 5
<location> `service/tests/unit/test_utils/test_openalex_client.py:484-493` </location>
<code_context>
+class TestOpenAlexClientRequestWithRetry:
</code_context>

<issue_to_address>
**suggestion (testing):** 扩展 `_request_with_retry` 的测试,覆盖不可重试的 4xx 响应和重试次数耗尽的情况

请另外添加以下测试:
- 对非 403 的 4xx 响应(例如 400/404),断言会调用 `response.raise_for_status()` 且不会发生重试;
- 针对可重试错误(例如连续的 500 或超时)重试直至耗尽 `MAX_RETRIES`,并断言在方法抛出异常时,`client.get` 恰好被调用了 `MAX_RETRIES` 次。
这些测试会将预期的重试语义固定下来,防止将来的修改引入回归。

建议实现:

```python
class TestOpenAlexClientRequestWithRetry:
    """Test OpenAlex client request retry logic."""

    @pytest.fixture
    def client(self) -> OpenAlexClient:
        """Create an OpenAlex client for testing."""
        return OpenAlexClient(email="test@example.com")

    @pytest.mark.asyncio
    async def test_request_with_retry_success(
        self,
        client: OpenAlexClient,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """Test successful request without retry."""
        calls: list[None] = []

        async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
            calls.append(None)

            class MockResponse:
                status_code = 200

                def raise_for_status(self) -> None:
                    return None

            return MockResponse()

        # Patch underlying httpx.AsyncClient.get
        monkeypatch.setattr(client._client, "get", mock_get)

        response = await client._request_with_retry("/works")

        assert response.status_code == 200
        # Ensure we don't retry on success
        assert len(calls) == 1

    @pytest.mark.asyncio
    async def test_request_with_retry_non_retryable_4xx(
        self,
        client: OpenAlexClient,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """
        Non-403 4xx responses should not be retried and should raise immediately
        via response.raise_for_status().
        """
        calls: list[None] = []

        async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
            calls.append(None)
            request = httpx.Request("GET", "https://api.openalex.org/works")
            response = httpx.Response(400, request=request)

            def raise_for_status() -> None:
                raise httpx.HTTPStatusError(
                    "Bad Request",
                    request=request,
                    response=response,
                )

            # Attach the custom raise_for_status to the response instance
            response.raise_for_status = raise_for_status  # type: ignore[assignment]
            return response

        monkeypatch.setattr(client._client, "get", mock_get)

        with pytest.raises(httpx.HTTPStatusError):
            await client._request_with_retry("/works")

        # Ensure a non-retriable 4xx results in no retries
        assert len(calls) == 1

    @pytest.mark.asyncio
    async def test_request_with_retry_exhausts_retries_on_5xx(
        self,
        client: OpenAlexClient,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """
        Retryable errors (e.g., 5xx) should be retried up to MAX_RETRIES
        and then the method should raise.
        """
        max_retries = getattr(client, "MAX_RETRIES", 3)
        call_count = 0

        async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
            nonlocal call_count
            call_count += 1
            request = httpx.Request("GET", "https://api.openalex.org/works")
            response = httpx.Response(500, request=request)

            def raise_for_status() -> None:
                raise httpx.HTTPStatusError(
                    "Server error",
                    request=request,
                    response=response,
                )

            response.raise_for_status = raise_for_status  # type: ignore[assignment]
            return response

        monkeypatch.setattr(client._client, "get", mock_get)

        with pytest.raises(httpx.HTTPStatusError):
            await client._request_with_retry("/works")

        # Should attempt exactly MAX_RETRIES times before giving up
        assert call_count == max_retries

```

1. 确保在 `service/tests/unit/test_utils/test_openalex_client.py` 顶部已导入 `httpx`,例如:`import httpx`2. 如果 `OpenAlexClient` 暴露重试次数的属性或常量名不是 `MAX_RETRIES`(例如 `_MAX_RETRIES``RETRY_ATTEMPTS`),请将 `getattr(client, "MAX_RETRIES", 3)` 和相关断言改为使用正确的属性,以保证测试与实现保持一致。
3. `_request_with_retry` 的签名当前假定接受一个类似 `"/works"` 的路径参数。如果实际 API 不同(例如需要完整 URL 或额外必填参数),请相应调整测试中的调用。
</issue_to_address>

Sourcery 对开源项目是免费的——如果你觉得这次审查有帮助,可以考虑分享给他人 ✨
帮我变得更有用!请对每条评论点 👍 或 👎,我会根据这些反馈改进后续的审查质量。
Original comment in English

Hey - I've found 5 issues, and left some high level feedback:

  • In OpenAlexClient._request_with_retry, 4xx responses other than 403 end up going through response.raise_for_status() and are then caught by the generic except Exception block, causing retries despite the docstring saying 4xx (except 403) should not be retried—consider handling httpx.HTTPStatusError for non-retryable status codes explicitly so they fail fast.
  • The tests and ConcreteClient assume a common get_by_doi/get_by_id interface on literature clients, but BaseLiteratureClient only defines search as abstract and OpenAlexClient does not implement those methods; it would be clearer to either add these as abstract methods on BaseLiteratureClient and implement them in OpenAlexClient, or adjust tests and helpers to rely only on the shared interface.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `OpenAlexClient._request_with_retry`, 4xx responses other than 403 end up going through `response.raise_for_status()` and are then caught by the generic `except Exception` block, causing retries despite the docstring saying 4xx (except 403) should not be retried—consider handling `httpx.HTTPStatusError` for non-retryable status codes explicitly so they fail fast.
- The tests and `ConcreteClient` assume a common `get_by_doi`/`get_by_id` interface on literature clients, but `BaseLiteratureClient` only defines `search` as abstract and `OpenAlexClient` does not implement those methods; it would be clearer to either add these as abstract methods on `BaseLiteratureClient` and implement them in `OpenAlexClient`, or adjust tests and helpers to rely only on the shared interface.

## Individual Comments

### Comment 1
<location> `service/app/utils/literature/openalex_client.py:266-275` </location>
<code_context>
+    sections.append("## Complete Works List (JSON)\n")
+    if include_abstract:
+        sections.append("The following JSON contains all works with full abstracts:\n")
+    else:
+        sections.append("The following JSON contains all works (abstracts excluded to save tokens):\n")
+    sections.append("```json")
</code_context>

<issue_to_address>
**issue (bug_risk):** 4xx errors (other than 403) are retried despite the comment saying they should not be

In `_request_with_retry`, non-403 4xx responses trigger `response.raise_for_status()`, which raises `httpx.HTTPStatusError` and is then caught by the broad `except Exception`, causing retries until `attempt < self.MAX_RETRIES - 1`. This contradicts the "Don't retry on 4xx (except 403)" behavior and can waste quota / add latency. Consider handling `httpx.HTTPStatusError` explicitly and re-raising immediately for 4xx (except 403), only applying backoff for 5xx/403/timeouts.
</issue_to_address>

### Comment 2
<location> `service/app/utils/literature/openalex_client.py:377-378` </location>
<code_context>
+            async with self.rate_limiter:
+                try:
+                    url = f"{self.BASE_URL}/works"
+                    page_params = {**params, "page": str(page)}
+                    response = await self._request_with_retry(url, page_params)
+
+                    works = response.get("results", [])
</code_context>

<issue_to_address>
**suggestion (performance):** Page size is always MAX_PER_PAGE even when max_results is small, causing unnecessary data transfer

In `_fetch_all_pages`, each request always uses `per-page=MAX_PER_PAGE` (200), so even when `max_results` is small (e.g., 10) you fetch 200 items and slice later. To avoid unnecessary bandwidth and latency, adjust `per-page` for each request to `min(self.MAX_PER_PAGE, max_results - len(all_works))`, so the last (or only) page only fetches what’s needed.

Suggested implementation:

```python
        all_works: list[dict[str, Any]] = []
        page = 1

        while len(all_works) < max_results:
            remaining_results = max_results - len(all_works)
            per_page = min(self.MAX_PER_PAGE, remaining_results)

            async with self.rate_limiter:

```

```python
                    url = f"{self.BASE_URL}/works"
                    page_params = {
                        **params,
                        "page": str(page),
                        "per-page": str(per_page),
                    }

```

If `params` is constructed earlier in this method (or by the caller) with a fixed `"per-page"` value, ensure that:
1. Either that fixed `"per-page"` is removed from the base `params`, relying on this override.
2. Or you are comfortable with this dynamic `"per-page"` in `page_params` intentionally overriding the base value on a per-request basis (the current code above assumes overriding is acceptable).
</issue_to_address>

### Comment 3
<location> `service/tests/unit/test_utils/test_openalex_client.py:338-352` </location>
<code_context>
+        assert params["per-page"] == "200"
+        assert params["mailto"] == "test@example.com"
+
+    def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
+        """Test building query parameters with filters."""
+        request = SearchRequest(
+            query="machine learning",
+            year_from=2015,
+            year_to=2021,
+            is_oa=True,
+            work_type="journal-article",
+        )
+        params = client._build_query_params(request, None, None, None)
+
+        assert "filter" in params
+        assert "publication_year:2015-2021" in params["filter"]
+        assert "is_oa:true" in params["filter"]
+        assert "type:journal-article" in params["filter"]
+
+    def test_build_query_params_with_resolved_ids(self, client: OpenAlexClient) -> None:
</code_context>

<issue_to_address>
**suggestion (testing):** Add coverage for language, retraction, abstract, and fulltext filters in `_build_query_params`.

`_build_query_params` already handles `language`, `is_retracted`, `has_abstract`, and `has_fulltext`, but the tests only cover `publication_year`, `is_oa`, and `work_type`. Please add or extend a test that sets all four of these fields on `SearchRequest` and asserts that `language:<code>`, `is_retracted:<bool>`, `has_abstract:<bool>`, and `has_fulltext:<bool>` appear in the `filter` string to guard against regressions.

```suggestion
    def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
        """Test building query parameters with filters."""
        request = SearchRequest(
            query="machine learning",
            year_from=2015,
            year_to=2021,
            is_oa=True,
            work_type="journal-article",
            language="en",
            is_retracted=False,
            has_abstract=True,
            has_fulltext=False,
        )
        params = client._build_query_params(request, None, None, None)

        assert "filter" in params
        assert "publication_year:2015-2021" in params["filter"]
        assert "is_oa:true" in params["filter"]
        assert "type:journal-article" in params["filter"]
        assert "language:en" in params["filter"]
        assert "is_retracted:false" in params["filter"]
        assert "has_abstract:true" in params["filter"]
        assert "has_fulltext:false" in params["filter"]
```
</issue_to_address>

### Comment 4
<location> `service/tests/unit/test_utils/test_openalex_client.py:49-58` </location>
<code_context>
+class TestOpenAlexClientSearch:
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding tests for `_fetch_all_pages` pagination behavior and termination conditions.

Current tests mock `_request_with_retry` to return a single page, so `_fetch_all_pages`’s pagination logic isn’t exercised. Please add tests that:
- simulate multiple pages and assert the total number of works returned
- verify the early-termination path when accumulated results reach `meta.count`
- cover the error path where `_request_with_retry` raises and the loop exits
These can target `_fetch_all_pages` directly (with a mocked `_request_with_retry`) or go through `search` with an appropriate mock setup.

Suggested implementation:

```python
class TestOpenAlexClientSearch:
    """Test OpenAlex search functionality."""

```

Please add the following tests inside `class TestOpenAlexClientSearch`, after the existing fixtures and existing `search`-related tests:

```python
    def test_fetch_all_pages_multiple_pages(self, client, mocker) -> None:
        """_fetch_all_pages returns combined results from multiple pages."""
        # Two pages of 2 results each
        page1 = {
            "results": [{"id": "W1"}, {"id": "W2"}],
            "meta": {"count": 4, "page": 1, "per_page": 2},
        }
        page2 = {
            "results": [{"id": "W3"}, {"id": "W4"}],
            "meta": {"count": 4, "page": 2, "per_page": 2},
        }

        mock_request = mocker.patch.object(
            client,
            "_request_with_retry",
            side_effect=[page1, page2],
        )

        works = list(
            client._fetch_all_pages(
                endpoint="works",
                params={"search": "test"},
                per_page=2,
                max_results=10,
            )
        )

        # All 4 works from 2 pages should be returned
        assert [w["id"] for w in works] == ["W1", "W2", "W3", "W4"]
        assert mock_request.call_count == 2

    def test_fetch_all_pages_terminates_at_meta_count(self, client, mocker) -> None:
        """_fetch_all_pages stops when accumulated results reach meta.count."""
        # meta.count indicates only 3 items even if pages could return more
        page1 = {
            "results": [{"id": "W1"}, {"id": "W2"}],
            "meta": {"count": 3, "page": 1, "per_page": 2},
        }
        page2 = {
            "results": [{"id": "W3"}, {"id": "W4"}],
            "meta": {"count": 3, "page": 2, "per_page": 2},
        }

        mock_request = mocker.patch.object(
            client,
            "_request_with_retry",
            side_effect=[page1, page2],
        )

        works = list(
            client._fetch_all_pages(
                endpoint="works",
                params={"search": "test"},
                per_page=2,
                max_results=10,
            )
        )

        # Only first 3 results should be yielded even though second page has 2
        assert [w["id"] for w in works] == ["W1", "W2", "W3"]
        assert mock_request.call_count == 2

    def test_fetch_all_pages_stops_on_request_error(self, client, mocker) -> None:
        """_fetch_all_pages exits when a request error occurs."""
        page1 = {
            "results": [{"id": "W1"}],
            "meta": {"count": 10, "page": 1, "per_page": 1},
        }

        def side_effect(*args, **kwargs):
            # First call returns a valid page, second call raises
            if side_effect.call_count == 0:
                side_effect.call_count += 1
                return page1
            raise RuntimeError("API error")

        side_effect.call_count = 0

        mock_request = mocker.patch.object(
            client,
            "_request_with_retry",
            side_effect=side_effect,
        )

        # If _fetch_all_pages propagates the error, this asserts the error path.
        # If instead it swallows the error and returns partial results, change this
        # to assert the returned list and that call_count == 2.
        with pytest.raises(RuntimeError):
            list(
                client._fetch_all_pages(
                    endpoint="works",
                    params={"search": "test"},
                    per_page=1,
                    max_results=10,
                )
            )

        assert mock_request.call_count == 2
```

Notes / things to verify against the existing implementation:

1. **Signature**: These tests assume `_fetch_all_pages` has a signature like:
   ```python
   def _fetch_all_pages(self, endpoint: str, params: dict, per_page: int, max_results: int | None = None) -> Iterable[dict]:
       ...
   ```
   If it differs (e.g., parameter names or positions), adjust the calls accordingly.

2. **Error handling semantics**: 
   - If `_fetch_all_pages` *swallows* exceptions from `_request_with_retry` and returns partial results, replace the `pytest.raises` block with assertions on the partial `works` list (e.g., only `["W1"]`) and keep the `call_count` assertion.
   - If it *always* propagates exceptions immediately without a second call, adjust the expected `call_count` to `1`.

3. **Test tools**:
   - These tests use the `mocker` fixture from `pytest-mock`, which is commonly used alongside pytest. If the project uses `unittest.mock` directly instead, convert the `mocker.patch.object(...)` calls to `with mock.patch.object(...)` or an equivalent pattern.

Once integrated with the correct `_fetch_all_pages` behavior and signature, these tests will:
- Exercise multi-page aggregation,
- Verify early termination when `meta.count` is reached,
- And cover the error path when `_request_with_retry` raises.
</issue_to_address>

### Comment 5
<location> `service/tests/unit/test_utils/test_openalex_client.py:484-493` </location>
<code_context>
+class TestOpenAlexClientRequestWithRetry:
</code_context>

<issue_to_address>
**suggestion (testing):** Extend `_request_with_retry` tests to cover non-retriable 4xx responses and max-retries exhaustion.

Please also add tests for:
- A non-403 4xx response (e.g., 400/404) to assert `response.raise_for_status()` is invoked and no retries occur.
- Exhaustion of retryable errors (e.g., repeated 500s or timeouts) to assert the method raises after `MAX_RETRIES` and `client.get` is called exactly `MAX_RETRIES` times.
These will lock in the intended retry semantics for future changes.

Suggested implementation:

```python
class TestOpenAlexClientRequestWithRetry:
    """Test OpenAlex client request retry logic."""

    @pytest.fixture
    def client(self) -> OpenAlexClient:
        """Create an OpenAlex client for testing."""
        return OpenAlexClient(email="test@example.com")

    @pytest.mark.asyncio
    async def test_request_with_retry_success(
        self,
        client: OpenAlexClient,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """Test successful request without retry."""
        calls: list[None] = []

        async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
            calls.append(None)

            class MockResponse:
                status_code = 200

                def raise_for_status(self) -> None:
                    return None

            return MockResponse()

        # Patch underlying httpx.AsyncClient.get
        monkeypatch.setattr(client._client, "get", mock_get)

        response = await client._request_with_retry("/works")

        assert response.status_code == 200
        # Ensure we don't retry on success
        assert len(calls) == 1

    @pytest.mark.asyncio
    async def test_request_with_retry_non_retryable_4xx(
        self,
        client: OpenAlexClient,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """
        Non-403 4xx responses should not be retried and should raise immediately
        via response.raise_for_status().
        """
        calls: list[None] = []

        async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
            calls.append(None)
            request = httpx.Request("GET", "https://api.openalex.org/works")
            response = httpx.Response(400, request=request)

            def raise_for_status() -> None:
                raise httpx.HTTPStatusError(
                    "Bad Request",
                    request=request,
                    response=response,
                )

            # Attach the custom raise_for_status to the response instance
            response.raise_for_status = raise_for_status  # type: ignore[assignment]
            return response

        monkeypatch.setattr(client._client, "get", mock_get)

        with pytest.raises(httpx.HTTPStatusError):
            await client._request_with_retry("/works")

        # Ensure a non-retriable 4xx results in no retries
        assert len(calls) == 1

    @pytest.mark.asyncio
    async def test_request_with_retry_exhausts_retries_on_5xx(
        self,
        client: OpenAlexClient,
        monkeypatch: pytest.MonkeyPatch,
    ) -> None:
        """
        Retryable errors (e.g., 5xx) should be retried up to MAX_RETRIES
        and then the method should raise.
        """
        max_retries = getattr(client, "MAX_RETRIES", 3)
        call_count = 0

        async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
            nonlocal call_count
            call_count += 1
            request = httpx.Request("GET", "https://api.openalex.org/works")
            response = httpx.Response(500, request=request)

            def raise_for_status() -> None:
                raise httpx.HTTPStatusError(
                    "Server error",
                    request=request,
                    response=response,
                )

            response.raise_for_status = raise_for_status  # type: ignore[assignment]
            return response

        monkeypatch.setattr(client._client, "get", mock_get)

        with pytest.raises(httpx.HTTPStatusError):
            await client._request_with_retry("/works")

        # Should attempt exactly MAX_RETRIES times before giving up
        assert call_count == max_retries

```

1. Ensure `httpx` is imported at the top of `service/tests/unit/test_utils/test_openalex_client.py`, for example:
   `import httpx`.
2. If `OpenAlexClient` exposes the retry count via a different attribute or constant name than `MAX_RETRIES`
   (e.g., `_MAX_RETRIES` or `RETRY_ATTEMPTS`), adjust `getattr(client, "MAX_RETRIES", 3)` and the assertion
   to use the correct attribute so the test matches the implementation.
3. The signature of `_request_with_retry` is assumed to accept a path like `"/works"`. If the actual API differs
   (e.g., it expects a full URL or additional required parameters), adapt the calls in the tests accordingly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +266 to +275
else:
msg = (
f"⚠️ Author '{author_name}' not found. "
f"Suggestions: (1) Try full name format like 'Smith, John' or 'John Smith', "
f"(2) Check spelling, (3) Try removing middle name/initial."
)
logger.warning(msg)
return None, False, msg
except Exception as e:
msg = f"⚠️ Failed to resolve author '{author_name}': {e}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): 4xx 错误(403 以外)会被重试,与注释所述行为不符

_request_with_retry 中,非 403 的 4xx 响应会触发 response.raise_for_status(),从而抛出 httpx.HTTPStatusError,然后被宽泛的 except Exception 捕获,导致在 attempt < self.MAX_RETRIES - 1 之前持续重试。这与“不要在 4xx(403 除外)上重试”的预期行为相矛盾,并且会浪费配额 / 增加延迟。建议显式处理 httpx.HTTPStatusError,对于 4xx(403 除外)立即重新抛出,只对 5xx/403/超时应用退避重试。

Original comment in English

issue (bug_risk): 4xx errors (other than 403) are retried despite the comment saying they should not be

In _request_with_retry, non-403 4xx responses trigger response.raise_for_status(), which raises httpx.HTTPStatusError and is then caught by the broad except Exception, causing retries until attempt < self.MAX_RETRIES - 1. This contradicts the "Don't retry on 4xx (except 403)" behavior and can waste quota / add latency. Consider handling httpx.HTTPStatusError explicitly and re-raising immediately for 4xx (except 403), only applying backoff for 5xx/403/timeouts.

Comment on lines +377 to +378
page_params = {**params, "page": str(page)}
response = await self._request_with_retry(url, page_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): 即使 max_results 很小,分页大小仍然始终为 MAX_PER_PAGE,导致不必要的数据传输

_fetch_all_pages 中,每个请求都使用 per-page=MAX_PER_PAGE(200),因此即便 max_results 很小(例如 10),你也会获取 200 条记录然后再做切片。为避免不必要的带宽和延迟,建议每次请求将 per-page 调整为 min(self.MAX_PER_PAGE, max_results - len(all_works)),这样最后一页(或唯一一页)只获取所需数量的数据。

建议的实现:

        all_works: list[dict[str, Any]] = []
        page = 1

        while len(all_works) < max_results:
            remaining_results = max_results - len(all_works)
            per_page = min(self.MAX_PER_PAGE, remaining_results)

            async with self.rate_limiter:
                    url = f"{self.BASE_URL}/works"
                    page_params = {
                        **params,
                        "page": str(page),
                        "per-page": str(per_page),
                    }

如果 params 在该方法更早的位置(或由调用方)构造时包含了固定的 "per-page" 值,请确保:

  1. 要么从基础 params 中移除该固定的 "per-page",完全依赖这里的覆盖;
  2. 要么你确实希望在每次请求时由这里的 "per-page" 动态覆盖基础值(上述代码默认假定覆盖是可以接受的)。
Original comment in English

suggestion (performance): Page size is always MAX_PER_PAGE even when max_results is small, causing unnecessary data transfer

In _fetch_all_pages, each request always uses per-page=MAX_PER_PAGE (200), so even when max_results is small (e.g., 10) you fetch 200 items and slice later. To avoid unnecessary bandwidth and latency, adjust per-page for each request to min(self.MAX_PER_PAGE, max_results - len(all_works)), so the last (or only) page only fetches what’s needed.

Suggested implementation:

        all_works: list[dict[str, Any]] = []
        page = 1

        while len(all_works) < max_results:
            remaining_results = max_results - len(all_works)
            per_page = min(self.MAX_PER_PAGE, remaining_results)

            async with self.rate_limiter:
                    url = f"{self.BASE_URL}/works"
                    page_params = {
                        **params,
                        "page": str(page),
                        "per-page": str(per_page),
                    }

If params is constructed earlier in this method (or by the caller) with a fixed "per-page" value, ensure that:

  1. Either that fixed "per-page" is removed from the base params, relying on this override.
  2. Or you are comfortable with this dynamic "per-page" in page_params intentionally overriding the base value on a per-request basis (the current code above assumes overriding is acceptable).

Comment on lines +338 to +352
def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
"""Test building query parameters with filters."""
request = SearchRequest(
query="machine learning",
year_from=2015,
year_to=2021,
is_oa=True,
work_type="journal-article",
)
params = client._build_query_params(request, None, None, None)

assert "filter" in params
assert "publication_year:2015-2021" in params["filter"]
assert "is_oa:true" in params["filter"]
assert "type:journal-article" in params["filter"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing):_build_query_params 中的语言、撤稿、摘要和全文过滤器增加测试覆盖。

_build_query_params 已经处理了 languageis_retractedhas_abstracthas_fulltext,但测试目前只覆盖了 publication_yearis_oawork_type。请新增或扩展一个测试,在 SearchRequest 上设置这四个字段,并断言 filter 字符串中包含 language:<code>is_retracted:<bool>has_abstract:<bool>has_fulltext:<bool>,以避免回归。

Suggested change
def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
"""Test building query parameters with filters."""
request = SearchRequest(
query="machine learning",
year_from=2015,
year_to=2021,
is_oa=True,
work_type="journal-article",
)
params = client._build_query_params(request, None, None, None)
assert "filter" in params
assert "publication_year:2015-2021" in params["filter"]
assert "is_oa:true" in params["filter"]
assert "type:journal-article" in params["filter"]
def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
"""Test building query parameters with filters."""
request = SearchRequest(
query="machine learning",
year_from=2015,
year_to=2021,
is_oa=True,
work_type="journal-article",
language="en",
is_retracted=False,
has_abstract=True,
has_fulltext=False,
)
params = client._build_query_params(request, None, None, None)
assert "filter" in params
assert "publication_year:2015-2021" in params["filter"]
assert "is_oa:true" in params["filter"]
assert "type:journal-article" in params["filter"]
assert "language:en" in params["filter"]
assert "is_retracted:false" in params["filter"]
assert "has_abstract:true" in params["filter"]
assert "has_fulltext:false" in params["filter"]
Original comment in English

suggestion (testing): Add coverage for language, retraction, abstract, and fulltext filters in _build_query_params.

_build_query_params already handles language, is_retracted, has_abstract, and has_fulltext, but the tests only cover publication_year, is_oa, and work_type. Please add or extend a test that sets all four of these fields on SearchRequest and asserts that language:<code>, is_retracted:<bool>, has_abstract:<bool>, and has_fulltext:<bool> appear in the filter string to guard against regressions.

Suggested change
def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
"""Test building query parameters with filters."""
request = SearchRequest(
query="machine learning",
year_from=2015,
year_to=2021,
is_oa=True,
work_type="journal-article",
)
params = client._build_query_params(request, None, None, None)
assert "filter" in params
assert "publication_year:2015-2021" in params["filter"]
assert "is_oa:true" in params["filter"]
assert "type:journal-article" in params["filter"]
def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
"""Test building query parameters with filters."""
request = SearchRequest(
query="machine learning",
year_from=2015,
year_to=2021,
is_oa=True,
work_type="journal-article",
language="en",
is_retracted=False,
has_abstract=True,
has_fulltext=False,
)
params = client._build_query_params(request, None, None, None)
assert "filter" in params
assert "publication_year:2015-2021" in params["filter"]
assert "is_oa:true" in params["filter"]
assert "type:journal-article" in params["filter"]
assert "language:en" in params["filter"]
assert "is_retracted:false" in params["filter"]
assert "has_abstract:true" in params["filter"]
assert "has_fulltext:false" in params["filter"]

@SoberPizza SoberPizza closed this Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant