-
Notifications
You must be signed in to change notification settings - Fork 5
feat: Implement Literature MCP for multi-source academic literature search #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…earch (skip pre-commit due to Pyright false positive on __getattr__)
Reviewer's Guide引入了一个新的多源文献搜索子系统,围绕 OpenAlex 客户端、DOI 规范化/去重工具、工作分发器以及 MCP 服务器工具构建,并为每个组件提供了完整的数据模型和测试。 search_literature MCP 工具流程的时序图sequenceDiagram
actor User
participant MCPClient
participant LiteratureMCP as FastMCP_literature_server
participant Tool as search_literature_tool
participant Distributor as WorkDistributor
participant OAClient as OpenAlexClient
participant OpenAlexAPI
User->>MCPClient: Configure and invoke search_literature
MCPClient->>LiteratureMCP: JSON-RPC tool call
LiteratureMCP->>Tool: search_literature(query, filters,...)
Tool->>Tool: Parse and coerce parameters
Tool->>Tool: Build SearchRequest
Tool->>Distributor: create WorkDistributor(openalex_email)
Tool->>Distributor: search(request: SearchRequest)
loop For each data source
alt openalex enabled
Distributor->>OAClient: search(request)
OAClient->>OAClient: _build_query_params(...)
par Resolve optional IDs
OAClient->>OpenAlexAPI: GET /authors?search=author
OpenAlexAPI-->>OAClient: author results
OAClient->>OpenAlexAPI: GET /institutions?search=institution
OpenAlexAPI-->>OAClient: institution results
OAClient->>OpenAlexAPI: GET /sources?search=source
OpenAlexAPI-->>OAClient: source results
end
loop Paginated works fetch
OAClient->>OpenAlexAPI: GET /works?page=n&per-page=200
OpenAlexAPI-->>OAClient: works page
OAClient->>OAClient: _request_with_retry(...) with backoff
end
OAClient->>Distributor: works, warnings
else unsupported source
Distributor->>Distributor: record warning
end
end
Distributor->>Distributor: deduplicate_by_doi(all_works)
Distributor->>Distributor: _sort_works(...)
Distributor-->>Tool: aggregated_result
Tool->>Tool: _format_search_result(...)
Tool-->>LiteratureMCP: Markdown report string
LiteratureMCP-->>MCPClient: Tool result
MCPClient-->>User: Rendered report + JSON works
新文献搜索子系统的类图classDiagram
class SearchRequest {
+str query
+str author
+str institution
+str source
+int year_from
+int year_to
+bool is_oa
+str work_type
+str language
+bool is_retracted
+bool has_abstract
+bool has_fulltext
+str sort_by
+int max_results
+list~str~ data_sources
}
class LiteratureWork {
+str id
+str doi
+str title
+list~dict~ authors
+int publication_year
+int cited_by_count
+str abstract
+str journal
+bool is_oa
+str oa_url
+str source
+dict raw_data
}
class BaseLiteratureClient {
<<abstract>>
+search(request: SearchRequest) tuple~list~LiteratureWork~~ list~str~~
}
class OpenAlexClient {
-str BASE_URL
-int MAX_PER_PAGE
-int MAX_RETRIES
-float TIMEOUT
-str email
-int rate_limit
-_RateLimiter rate_limiter
-AsyncClient client
+OpenAlexClient(email: str, rate_limit: int, timeout: float)
+str pool_type
+search(request: SearchRequest) tuple~list~LiteratureWork~~ list~str~~
-_build_query_params(request: SearchRequest, author_id: str, institution_id: str, source_id: str) dict~str, str~
-_resolve_author_id(author_name: str) tuple~str, bool, str~
-_resolve_institution_id(institution_name: str) tuple~str, bool, str~
-_resolve_source_id(source_name: str) tuple~str, bool, str~
-_fetch_all_pages(params: dict~str, str~, max_results: int) list~dict~
-_request_with_retry(url: str, params: dict~str, str~) dict~str, any~
-_transform_work(work: dict~str, any~) LiteratureWork
-_reconstruct_abstract(inverted_index: dict~str, list~int~~) str
+close() None
}
class _RateLimiter {
-float _min_interval
-Lock _lock
-float _last_request
-Semaphore _semaphore
+_RateLimiter(rate_per_second: float, max_concurrency: int)
+__aenter__() None
+__aexit__(exc_type: type, exc: BaseException, tb: any) None
-_throttle() None
}
class WorkWithDOI {
<<interface>>
+str doi
+int cited_by_count
+int publication_year
}
class WorkDistributor {
-dict~str, any~ clients
-str openalex_email
+WorkDistributor(openalex_email: str)
+search(request: SearchRequest) dict~str, any~
+close() None
-_register_clients() None
-_sort_works(works: list~LiteratureWork~, sort_by: str) list~LiteratureWork~
}
class doi_cleaner_utils {
<<utility>>
+normalize_doi(doi: str) str
+deduplicate_by_doi(works: list~WorkWithDOI~) list~WorkWithDOI~
}
BaseLiteratureClient <|-- OpenAlexClient
WorkWithDOI <|.. LiteratureWork
WorkDistributor --> BaseLiteratureClient : holds clients
WorkDistributor --> LiteratureWork : aggregates
WorkDistributor --> SearchRequest : consumes
OpenAlexClient --> SearchRequest : consumes
OpenAlexClient --> LiteratureWork : produces
doi_cleaner_utils ..> WorkWithDOI : operates_on
WorkDistributor ..> doi_cleaner_utils : uses
文件级变更
Tips and commandsInteracting with Sourcery
Customizing Your Experience访问你的 dashboard 以:
Getting HelpOriginal review guide in EnglishReviewer's GuideIntroduces a new multi-source literature search subsystem centered around an OpenAlex client, a DOI normalization/deduplication utility, a work distributor, and an MCP server tool, along with comprehensive data models and tests for each component. Sequence diagram for the search_literature MCP tool flowsequenceDiagram
actor User
participant MCPClient
participant LiteratureMCP as FastMCP_literature_server
participant Tool as search_literature_tool
participant Distributor as WorkDistributor
participant OAClient as OpenAlexClient
participant OpenAlexAPI
User->>MCPClient: Configure and invoke search_literature
MCPClient->>LiteratureMCP: JSON-RPC tool call
LiteratureMCP->>Tool: search_literature(query, filters,...)
Tool->>Tool: Parse and coerce parameters
Tool->>Tool: Build SearchRequest
Tool->>Distributor: create WorkDistributor(openalex_email)
Tool->>Distributor: search(request: SearchRequest)
loop For each data source
alt openalex enabled
Distributor->>OAClient: search(request)
OAClient->>OAClient: _build_query_params(...)
par Resolve optional IDs
OAClient->>OpenAlexAPI: GET /authors?search=author
OpenAlexAPI-->>OAClient: author results
OAClient->>OpenAlexAPI: GET /institutions?search=institution
OpenAlexAPI-->>OAClient: institution results
OAClient->>OpenAlexAPI: GET /sources?search=source
OpenAlexAPI-->>OAClient: source results
end
loop Paginated works fetch
OAClient->>OpenAlexAPI: GET /works?page=n&per-page=200
OpenAlexAPI-->>OAClient: works page
OAClient->>OAClient: _request_with_retry(...) with backoff
end
OAClient->>Distributor: works, warnings
else unsupported source
Distributor->>Distributor: record warning
end
end
Distributor->>Distributor: deduplicate_by_doi(all_works)
Distributor->>Distributor: _sort_works(...)
Distributor-->>Tool: aggregated_result
Tool->>Tool: _format_search_result(...)
Tool-->>LiteratureMCP: Markdown report string
LiteratureMCP-->>MCPClient: Tool result
MCPClient-->>User: Rendered report + JSON works
Class diagram for the new literature search subsystemclassDiagram
class SearchRequest {
+str query
+str author
+str institution
+str source
+int year_from
+int year_to
+bool is_oa
+str work_type
+str language
+bool is_retracted
+bool has_abstract
+bool has_fulltext
+str sort_by
+int max_results
+list~str~ data_sources
}
class LiteratureWork {
+str id
+str doi
+str title
+list~dict~ authors
+int publication_year
+int cited_by_count
+str abstract
+str journal
+bool is_oa
+str oa_url
+str source
+dict raw_data
}
class BaseLiteratureClient {
<<abstract>>
+search(request: SearchRequest) tuple~list~LiteratureWork~~ list~str~~
}
class OpenAlexClient {
-str BASE_URL
-int MAX_PER_PAGE
-int MAX_RETRIES
-float TIMEOUT
-str email
-int rate_limit
-_RateLimiter rate_limiter
-AsyncClient client
+OpenAlexClient(email: str, rate_limit: int, timeout: float)
+str pool_type
+search(request: SearchRequest) tuple~list~LiteratureWork~~ list~str~~
-_build_query_params(request: SearchRequest, author_id: str, institution_id: str, source_id: str) dict~str, str~
-_resolve_author_id(author_name: str) tuple~str, bool, str~
-_resolve_institution_id(institution_name: str) tuple~str, bool, str~
-_resolve_source_id(source_name: str) tuple~str, bool, str~
-_fetch_all_pages(params: dict~str, str~, max_results: int) list~dict~
-_request_with_retry(url: str, params: dict~str, str~) dict~str, any~
-_transform_work(work: dict~str, any~) LiteratureWork
-_reconstruct_abstract(inverted_index: dict~str, list~int~~) str
+close() None
}
class _RateLimiter {
-float _min_interval
-Lock _lock
-float _last_request
-Semaphore _semaphore
+_RateLimiter(rate_per_second: float, max_concurrency: int)
+__aenter__() None
+__aexit__(exc_type: type, exc: BaseException, tb: any) None
-_throttle() None
}
class WorkWithDOI {
<<interface>>
+str doi
+int cited_by_count
+int publication_year
}
class WorkDistributor {
-dict~str, any~ clients
-str openalex_email
+WorkDistributor(openalex_email: str)
+search(request: SearchRequest) dict~str, any~
+close() None
-_register_clients() None
-_sort_works(works: list~LiteratureWork~, sort_by: str) list~LiteratureWork~
}
class doi_cleaner_utils {
<<utility>>
+normalize_doi(doi: str) str
+deduplicate_by_doi(works: list~WorkWithDOI~) list~WorkWithDOI~
}
BaseLiteratureClient <|-- OpenAlexClient
WorkWithDOI <|.. LiteratureWork
WorkDistributor --> BaseLiteratureClient : holds clients
WorkDistributor --> LiteratureWork : aggregates
WorkDistributor --> SearchRequest : consumes
OpenAlexClient --> SearchRequest : consumes
OpenAlexClient --> LiteratureWork : produces
doi_cleaner_utils ..> WorkWithDOI : operates_on
WorkDistributor ..> doi_cleaner_utils : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements a multi-source academic literature search MCP (Model Context Protocol) service with OpenAlex integration. The implementation provides a unified interface for searching academic literature with advanced filtering capabilities including author, institution, journal, year range, and work type filters. The system includes intelligent DOI-based deduplication, rate limiting, retry logic with exponential backoff, and comprehensive test coverage.
Changes:
- Added literature search utilities with OpenAlex API client implementation
- Implemented DOI normalization and deduplication logic
- Created MCP server endpoints for literature search
- Added comprehensive unit tests for all components
- Included detailed documentation in README
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| service/app/utils/literature/base_client.py | Abstract base class defining the client protocol |
| service/app/utils/literature/models.py | Data models for search requests and literature works |
| service/app/utils/literature/doi_cleaner.py | DOI normalization and deduplication utilities |
| service/app/utils/literature/openalex_client.py | OpenAlex API client with rate limiting and retry logic |
| service/app/utils/literature/work_distributor.py | Work distributor coordinating multiple data sources |
| service/app/utils/literature/init.py | Package initialization and exports |
| service/app/mcp/literature.py | MCP server implementation with search endpoints |
| service/app/utils/literature/README.md | Comprehensive documentation with examples |
| service/tests/unit/test_utils/test_base_client.py | Unit tests for base client and models |
| service/tests/unit/test_utils/test_doi_cleaner.py | Unit tests for DOI utilities |
| service/tests/unit/test_utils/test_openalex_client.py | Unit tests for OpenAlex client |
| service/tests/unit/test_utils/test_work_distributor.py | Unit tests for work distributor |
| .gitignore | Updated to exclude MCP documentation and VS folder |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| source="openalex", | ||
| ) | ||
|
|
||
| # DataclassesObjects with same values should be equal |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "# DataclassesObjects with same values should be equal" contains a typo. It should be "Dataclass objects" (two words, not one).
| # DataclassesObjects with same values should be equal | |
| # Dataclass objects with same values should be equal |
| now = asyncio.get_running_loop().time() | ||
| wait_time = self._last_request + self._min_interval - now | ||
| if wait_time > 0: | ||
| await asyncio.sleep(wait_time) | ||
| self._last_request = asyncio.get_running_loop().time() |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rate limiter's _throttle method uses asyncio.get_running_loop().time() twice (lines 55 and 59), which could result in a slight drift in timing. Consider storing the time after sleep in a single call to ensure accuracy.
| now = asyncio.get_running_loop().time() | |
| wait_time = self._last_request + self._min_interval - now | |
| if wait_time > 0: | |
| await asyncio.sleep(wait_time) | |
| self._last_request = asyncio.get_running_loop().time() | |
| loop = asyncio.get_running_loop() | |
| now = loop.time() | |
| wait_time = self._last_request + self._min_interval - now | |
| if wait_time > 0: | |
| await asyncio.sleep(wait_time) | |
| now = loop.time() | |
| self._last_request = now |
| else: | ||
| raise | ||
|
|
||
| raise Exception(f"Failed after {self.MAX_RETRIES} retries") |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop variable attempt is used but starts at 0, which means the log messages will show "attempt 1" for index 0, "attempt 2" for index 1, etc. When the final retry fails, the error message says "Failed after {MAX_RETRIES} retries" which would be 5, but only 5 attempts were made (indices 0-4), not 5 retries after the first attempt. Consider clarifying the terminology or adjusting the logic to be more accurate.
| raise Exception(f"Failed after {self.MAX_RETRIES} retries") | |
| raise Exception(f"Failed after {self.MAX_RETRIES} attempts") |
| max_year = datetime.now().year + 1 | ||
| year_warning = "" | ||
| if year_from_int is not None and year_from_int > max_year: | ||
| year_warning += f"year_from {year_from_int}→{max_year}. " | ||
| year_from_int = max_year |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The year validation allows year_from values up to datetime.now().year + 1, which means future years are accepted. While this may be intentional for pre-prints or advance publications, it creates a potential issue where searches can include dates in the future. Consider whether this is the desired behavior or if validation should be stricter.
| except Exception as e: | ||
| logger.error(f"Literature search failed: {e}", exc_info=True) | ||
| return "❌ Unexpected error during search. Please retry or contact support." |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling catches generic Exception and returns a user-friendly message, but this could mask important errors. Consider logging the full exception details while still providing a user-friendly message, or being more specific about which exceptions to catch.
| class OpenAlexClient(BaseLiteratureClient): | ||
| """ | ||
| OpenAlex API client | ||
|
|
||
| Implements best practices from official API guide for LLMs: | ||
| https://docs.openalex.org/api-guide-for-llms | ||
| """ | ||
|
|
||
| BASE_URL = "https://api.openalex.org" | ||
| MAX_PER_PAGE = 200 | ||
| MAX_RETRIES = 5 | ||
| TIMEOUT = 30.0 | ||
|
|
||
| def __init__(self, email: str | None, rate_limit: int | None = None, timeout: float = 30.0) -> None: | ||
| """ | ||
| Initialize OpenAlex client | ||
|
|
||
| Args: | ||
| email: Email for polite pool (10x rate limit increase). If None, use default pool. | ||
| rate_limit: Requests per second (default: 10 with email, 1 without email) | ||
| timeout: Request timeout in seconds (default: 30.0) | ||
| """ | ||
| self.email = email | ||
| self.rate_limit = rate_limit or (10 if self.email else 1) | ||
| max_concurrency = 10 if self.email else 1 | ||
| self.rate_limiter = _RateLimiter(rate_per_second=self.rate_limit, max_concurrency=max_concurrency) | ||
| self.client = httpx.AsyncClient(timeout=timeout) | ||
| pool_type = "polite" if self.email else "default" | ||
| logger.info( | ||
| f"OpenAlex client initialized with pool={pool_type}, email={self.email}, rate_limit={self.rate_limit}/s" | ||
| ) | ||
|
|
||
| @property | ||
| def pool_type(self) -> str: | ||
| """Return pool type string.""" | ||
| return "polite" if self.email else "default" | ||
|
|
||
| async def search(self, request: SearchRequest) -> tuple[list[LiteratureWork], list[str]]: | ||
| """ | ||
| Execute search and return results in standard format | ||
|
|
||
| Implementation steps: | ||
| 1. Convert author name -> author ID (if specified) | ||
| 2. Convert institution name -> institution ID (if specified) | ||
| 3. Convert journal name -> source ID (if specified) | ||
| 4. Build filter query | ||
| 5. Paginate through results | ||
| 6. Transform to standard format | ||
|
|
||
| Args: | ||
| request: Standardized search request | ||
|
|
||
| Returns: | ||
| Tuple of (works, warnings) | ||
| - works: List of literature works in standard format | ||
| - warnings: List of warning/info messages for LLM feedback | ||
| """ | ||
| logger.info( | ||
| f"OpenAlex search [{self.pool_type} @ {self.rate_limit}/s]: query='{request.query}', max_results={request.max_results}" | ||
| ) | ||
|
|
||
| warnings: list[str] = [] | ||
|
|
||
| # Step 1-3: Resolve IDs for names (two-step lookup pattern) | ||
| author_id = None | ||
| if request.author: | ||
| author_id, _success, msg = await self._resolve_author_id(request.author) | ||
| warnings.append(msg) | ||
|
|
||
| institution_id = None | ||
| if request.institution: | ||
| institution_id, _success, msg = await self._resolve_institution_id(request.institution) | ||
| warnings.append(msg) | ||
|
|
||
| source_id = None | ||
| if request.source: | ||
| source_id, _success, msg = await self._resolve_source_id(request.source) | ||
| warnings.append(msg) | ||
|
|
||
| # Step 4: Build query parameters | ||
| params = self._build_query_params(request, author_id, institution_id, source_id) | ||
|
|
||
| # Step 5: Fetch all pages | ||
| works = await self._fetch_all_pages(params, request.max_results) | ||
|
|
||
| # Step 6: Transform to standard format | ||
| return [self._transform_work(w) for w in works], warnings | ||
|
|
||
| def _build_query_params( | ||
| self, | ||
| request: SearchRequest, | ||
| author_id: str | None, | ||
| institution_id: str | None, | ||
| source_id: str | None, | ||
| ) -> dict[str, str]: | ||
| """ | ||
| Build OpenAlex query parameters | ||
|
|
||
| Args: | ||
| request: Search request | ||
| author_id: Resolved author ID (if any) | ||
| institution_id: Resolved institution ID (if any) | ||
| source_id: Resolved source ID (if any) | ||
|
|
||
| Returns: | ||
| Dictionary of query parameters | ||
| """ | ||
| params: dict[str, str] = { | ||
| "per-page": str(self.MAX_PER_PAGE), | ||
| } | ||
|
|
||
| if self.email: | ||
| params["mailto"] = self.email | ||
|
|
||
| # Search keywords | ||
| if request.query: | ||
| params["search"] = request.query | ||
|
|
||
| # Build filters | ||
| filters: list[str] = [] | ||
|
|
||
| if author_id: | ||
| filters.append(f"authorships.author.id:{author_id}") | ||
|
|
||
| if institution_id: | ||
| filters.append(f"authorships.institutions.id:{institution_id}") | ||
|
|
||
| if source_id: | ||
| filters.append(f"primary_location.source.id:{source_id}") | ||
|
|
||
| # Year range | ||
| if request.year_from or request.year_to: | ||
| if request.year_from and request.year_to: | ||
| filters.append(f"publication_year:{request.year_from}-{request.year_to}") | ||
| elif request.year_from: | ||
| filters.append(f"publication_year:>{request.year_from - 1}") | ||
| elif request.year_to: | ||
| filters.append(f"publication_year:<{request.year_to + 1}") | ||
|
|
||
| # Open access filter | ||
| if request.is_oa is not None: | ||
| filters.append(f"is_oa:{str(request.is_oa).lower()}") | ||
|
|
||
| # Work type filter | ||
| if request.work_type: | ||
| filters.append(f"type:{request.work_type}") | ||
|
|
||
| # Language filter | ||
| if request.language: | ||
| filters.append(f"language:{request.language}") | ||
|
|
||
| # Retracted filter | ||
| if request.is_retracted is not None: | ||
| filters.append(f"is_retracted:{str(request.is_retracted).lower()}") | ||
|
|
||
| # Abstract filter | ||
| if request.has_abstract is not None: | ||
| filters.append(f"has_abstract:{str(request.has_abstract).lower()}") | ||
|
|
||
| # Fulltext filter | ||
| if request.has_fulltext is not None: | ||
| filters.append(f"has_fulltext:{str(request.has_fulltext).lower()}") | ||
|
|
||
| if filters: | ||
| params["filter"] = ",".join(filters) | ||
|
|
||
| # Sorting | ||
| sort_map = { | ||
| "relevance": None, # Default sorting by relevance | ||
| "cited_by_count": "cited_by_count:desc", | ||
| "publication_date": "publication_date:desc", | ||
| } | ||
| if sort := sort_map.get(request.sort_by): | ||
| params["sort"] = sort | ||
|
|
||
| return params | ||
|
|
||
| async def _resolve_author_id(self, author_name: str) -> tuple[str | None, bool, str]: | ||
| """ | ||
| Two-step lookup: author name -> author ID | ||
|
|
||
| Args: | ||
| author_name: Author name to search | ||
|
|
||
| Returns: | ||
| Tuple of (author_id, success, message) | ||
| - author_id: Author ID (e.g., "A5023888391") or None if not found | ||
| - success: Whether resolution was successful | ||
| - message: Status message for LLM feedback | ||
| """ | ||
| async with self.rate_limiter: | ||
| try: | ||
| url = f"{self.BASE_URL}/authors" | ||
| params: dict[str, str] = {"search": author_name} | ||
| if self.email: | ||
| params["mailto"] = self.email | ||
| response = await self._request_with_retry(url, params) | ||
|
|
||
| if results := response.get("results", []): | ||
| # Return first result's ID in short format | ||
| author_id = results[0]["id"].split("/")[-1] | ||
| author_display = results[0].get("display_name", author_name) | ||
| logger.info(f"Resolved author '{author_name}' -> {author_id}") | ||
| return author_id, True, f"✓ Author resolved: '{author_name}' -> '{author_display}'" | ||
| else: | ||
| msg = ( | ||
| f"⚠️ Author '{author_name}' not found. " | ||
| f"Suggestions: (1) Try full name format like 'Smith, John' or 'John Smith', " | ||
| f"(2) Check spelling, (3) Try removing middle name/initial." | ||
| ) | ||
| logger.warning(msg) | ||
| return None, False, msg | ||
| except Exception as e: | ||
| msg = f"⚠️ Failed to resolve author '{author_name}': {e}" | ||
| logger.warning(msg) | ||
| return None, False, msg | ||
|
|
||
| async def _resolve_institution_id(self, institution_name: str) -> tuple[str | None, bool, str]: | ||
| """ | ||
| Two-step lookup: institution name -> institution ID | ||
|
|
||
| Args: | ||
| institution_name: Institution name to search | ||
|
|
||
| Returns: | ||
| Tuple of (institution_id, success, message) | ||
| - institution_id: Institution ID (e.g., "I136199984") or None if not found | ||
| - success: Whether resolution was successful | ||
| - message: Status message for LLM feedback | ||
| """ | ||
| async with self.rate_limiter: | ||
| try: | ||
| url = f"{self.BASE_URL}/institutions" | ||
| params: dict[str, str] = {"search": institution_name} | ||
| if self.email: | ||
| params["mailto"] = self.email | ||
| response = await self._request_with_retry(url, params) | ||
|
|
||
| if results := response.get("results", []): | ||
| institution_id = results[0]["id"].split("/")[-1] | ||
| inst_display = results[0].get("display_name", institution_name) | ||
| logger.info(f"Resolved institution '{institution_name}' -> {institution_id}") | ||
| return institution_id, True, f"✓ Institution resolved: '{institution_name}' -> '{inst_display}'" | ||
| else: | ||
| msg = ( | ||
| f"⚠️ Institution '{institution_name}' not found. " | ||
| f"Suggestions: (1) Use full official name (e.g., 'Harvard University' not 'Harvard'), " | ||
| f"(2) Try variations (e.g., 'MIT' vs 'Massachusetts Institute of Technology'), " | ||
| f"(3) Check spelling." | ||
| ) | ||
| logger.warning(msg) | ||
| return None, False, msg | ||
| except Exception as e: | ||
| msg = f"⚠️ Failed to resolve institution '{institution_name}': {e}" | ||
| logger.warning(msg) | ||
| return None, False, msg | ||
|
|
||
| async def _resolve_source_id(self, source_name: str) -> tuple[str | None, bool, str]: | ||
| """ | ||
| Two-step lookup: source name -> source ID | ||
|
|
||
| Args: | ||
| source_name: Journal/conference name to search | ||
|
|
||
| Returns: | ||
| Tuple of (source_id, success, message) | ||
| - source_id: Source ID (e.g., "S137773608") or None if not found | ||
| - success: Whether resolution was successful | ||
| - message: Status message for LLM feedback | ||
| """ | ||
| async with self.rate_limiter: | ||
| try: | ||
| url = f"{self.BASE_URL}/sources" | ||
| params: dict[str, str] = {"search": source_name} | ||
| if self.email: | ||
| params["mailto"] = self.email | ||
| response = await self._request_with_retry(url, params) | ||
|
|
||
| if results := response.get("results", []): | ||
| source_id = results[0]["id"].split("/")[-1] | ||
| source_display = results[0].get("display_name", source_name) | ||
| logger.info(f"Resolved source '{source_name}' -> {source_id}") | ||
| return source_id, True, f"✓ Source resolved: '{source_name}' -> '{source_display}'" | ||
| else: | ||
| msg = ( | ||
| f"⚠️ Source/Journal '{source_name}' not found. " | ||
| f"Suggestions: (1) Use full journal name (e.g., 'Nature' or 'Science'), " | ||
| f"(2) Try alternative names (e.g., 'JAMA' vs 'Journal of the American Medical Association'), " | ||
| f"(3) Check spelling." | ||
| ) | ||
| logger.warning(msg) | ||
| return None, False, msg | ||
| except Exception as e: | ||
| msg = f"⚠️ Failed to resolve source '{source_name}': {e}" | ||
| logger.warning(msg) | ||
| return None, False, msg | ||
|
|
||
| async def _fetch_all_pages(self, params: dict[str, str], max_results: int) -> list[dict[str, Any]]: | ||
| """ | ||
| Paginate through all results up to max_results | ||
|
|
||
| Args: | ||
| params: Base query parameters | ||
| max_results: Maximum number of results to fetch | ||
|
|
||
| Returns: | ||
| List of work objects from API | ||
| """ | ||
| all_works: list[dict[str, Any]] = [] | ||
| page = 1 | ||
|
|
||
| while len(all_works) < max_results: | ||
| async with self.rate_limiter: | ||
| try: | ||
| url = f"{self.BASE_URL}/works" | ||
| page_params = {**params, "page": str(page)} | ||
| response = await self._request_with_retry(url, page_params) | ||
|
|
||
| works = response.get("results", []) | ||
| if not works: | ||
| break | ||
|
|
||
| all_works.extend(works) | ||
| logger.info(f"Fetched page {page}: {len(works)} works") | ||
|
|
||
| # Check if there are more pages | ||
| meta = response.get("meta", {}) | ||
| total_count = meta.get("count", 0) | ||
| if len(all_works) >= total_count: | ||
| break | ||
|
|
||
| page += 1 | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Error fetching page {page}: {e}") | ||
| break | ||
|
|
||
| return all_works[:max_results] | ||
|
|
||
| async def _request_with_retry(self, url: str, params: dict[str, str]) -> dict[str, Any]: | ||
| """ | ||
| HTTP request with exponential backoff retry | ||
|
|
||
| Implements best practices: | ||
| - Retry on 403 (rate limit) with exponential backoff | ||
| - Retry on 5xx (server error) with exponential backoff | ||
| - Don't retry on 4xx (except 403) | ||
| - Retry on timeout | ||
|
|
||
| Args: | ||
| url: Request URL | ||
| params: Query parameters | ||
|
|
||
| Returns: | ||
| JSON response | ||
|
|
||
| Raises: | ||
| Exception: If all retries fail | ||
| """ | ||
| for attempt in range(self.MAX_RETRIES): | ||
| try: | ||
| response = await self.client.get(url, params=params) | ||
|
|
||
| if response.status_code == 200: | ||
| return response.json() | ||
| elif response.status_code == 403: | ||
| # Rate limited | ||
| wait_time = 2**attempt | ||
| logger.warning(f"Rate limited (403), waiting {wait_time}s... (attempt {attempt + 1})") | ||
| await asyncio.sleep(wait_time) | ||
| elif response.status_code >= 500: | ||
| # Server error | ||
| wait_time = 2**attempt | ||
| logger.warning( | ||
| f"Server error ({response.status_code}), waiting {wait_time}s... (attempt {attempt + 1})" | ||
| ) | ||
| await asyncio.sleep(wait_time) | ||
| else: | ||
| # Other error, don't retry | ||
| response.raise_for_status() | ||
|
|
||
| except httpx.TimeoutException: | ||
| if attempt < self.MAX_RETRIES - 1: | ||
| wait_time = 2**attempt | ||
| logger.warning(f"Timeout, retrying in {wait_time}s... (attempt {attempt + 1})") | ||
| await asyncio.sleep(wait_time) | ||
| else: | ||
| raise | ||
| except Exception as e: | ||
| logger.error(f"Request failed: {e}") | ||
| if attempt < self.MAX_RETRIES - 1: | ||
| wait_time = 2**attempt | ||
| await asyncio.sleep(wait_time) | ||
| else: | ||
| raise | ||
|
|
||
| raise Exception(f"Failed after {self.MAX_RETRIES} retries") | ||
|
|
||
| def _transform_work(self, work: dict[str, Any]) -> LiteratureWork: | ||
| """ | ||
| Transform OpenAlex work data to standard format | ||
|
|
||
| Args: | ||
| work: Raw work object from OpenAlex API | ||
|
|
||
| Returns: | ||
| Standardized LiteratureWork object | ||
| """ | ||
| # Extract authors | ||
| authors: list[dict[str, str | None]] = [] | ||
| for authorship in work.get("authorships", []): | ||
| author = authorship.get("author", {}) | ||
| authors.append( | ||
| { | ||
| "name": author.get("display_name", "Unknown"), | ||
| "id": author.get("id", "").split("/")[-1] if author.get("id") else None, | ||
| } | ||
| ) | ||
|
|
||
| # Extract journal/source | ||
| journal = None | ||
| if primary_location := work.get("primary_location"): | ||
| if source := primary_location.get("source"): | ||
| journal = source.get("display_name") | ||
|
|
||
| # Extract open access info | ||
| oa_info = work.get("open_access", {}) | ||
| is_oa = oa_info.get("is_oa", False) | ||
| oa_url = oa_info.get("oa_url") | ||
|
|
||
| # Extract abstract (reconstruct from inverted index) | ||
| abstract = self._reconstruct_abstract(work.get("abstract_inverted_index")) | ||
|
|
||
| # Extract DOI (remove prefix) | ||
| doi = None | ||
| if doi_raw := work.get("doi"): | ||
| doi = doi_raw.replace("https://doi.org/", "") | ||
|
|
||
| return LiteratureWork( | ||
| id=work["id"].split("/")[-1], | ||
| doi=doi, | ||
| title=work.get("title", "Untitled"), | ||
| authors=authors, | ||
| publication_year=work.get("publication_year"), | ||
| cited_by_count=work.get("cited_by_count", 0), | ||
| abstract=abstract, | ||
| journal=journal, | ||
| is_oa=is_oa, | ||
| oa_url=oa_url, | ||
| source="openalex", | ||
| raw_data=work, | ||
| ) | ||
|
|
||
| def _reconstruct_abstract(self, inverted_index: dict[str, list[int]] | None) -> str | None: | ||
| """ | ||
| Reconstruct abstract from inverted index | ||
|
|
||
| OpenAlex stores abstracts as inverted index for efficiency. | ||
| Format: {"word": [position1, position2, ...], ...} | ||
|
|
||
| Args: | ||
| inverted_index: Inverted index from OpenAlex | ||
|
|
||
| Returns: | ||
| Reconstructed abstract text or None | ||
|
|
||
| Examples: | ||
| >>> index = {"Hello": [0], "world": [1], "!": [2]} | ||
| >>> _reconstruct_abstract(index) | ||
| "Hello world !" | ||
| """ | ||
| if not inverted_index: | ||
| return None | ||
|
|
||
| # Expand inverted index to (position, word) pairs | ||
| word_positions: list[tuple[int, str]] = [] | ||
| for word, positions in inverted_index.items(): | ||
| for pos in positions: | ||
| word_positions.append((pos, word)) | ||
|
|
||
| # Sort by position and join | ||
| word_positions.sort() | ||
| return " ".join(word for _, word in word_positions) | ||
|
|
||
| async def close(self) -> None: | ||
| """Close the HTTP client""" | ||
| await self.client.aclose() | ||
|
|
||
| async def __aenter__(self) -> "OpenAlexClient": | ||
| return self | ||
|
|
||
| async def __aexit__( | ||
| self, | ||
| exc_type: type[BaseException] | None, | ||
| exc: BaseException | None, | ||
| tb: Any | None, | ||
| ) -> None: | ||
| await self.close() |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OpenAlexClient class is missing implementations for get_by_doi and get_by_id methods that are tested in the test file. These methods need to be implemented to fulfill the BaseLiteratureClient protocol.
| if request.year_from and request.year_to: | ||
| filters.append(f"publication_year:{request.year_from}-{request.year_to}") | ||
| elif request.year_from: | ||
| filters.append(f"publication_year:>{request.year_from - 1}") |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the year_from condition on line 197, when only year_from is specified, the filter uses >{year_from - 1} which would include the year_from itself. However, this creates an off-by-one risk if the OpenAlex API interprets this as "strictly greater than". Consider using >={year_from} for clarity and correctness, or verify the API documentation confirms that >{year_from - 1} is the intended approach.
| filters.append(f"publication_year:>{request.year_from - 1}") | |
| filters.append(f"publication_year:>={request.year_from}") |
| elif request.year_from: | ||
| filters.append(f"publication_year:>{request.year_from - 1}") | ||
| elif request.year_to: | ||
| filters.append(f"publication_year:<{request.year_to + 1}") |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, in the year_to condition on line 199, the filter uses <{year_to + 1} which may be confusing. Consider using <={year_to} for clarity if the API supports it.
| filters.append(f"publication_year:<{request.year_to + 1}") | |
| filters.append(f"publication_year:<={request.year_to}") |
| if year_to_int is not None and year_to_int < 1700: | ||
| year_warning += f"year_to {year_to_int}→1700. " | ||
| year_to_int = 1700 |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The year_to validation sets a minimum of 1700, but this magic number should be documented or made a constant. Additionally, consider whether 1700 is the appropriate minimum year for academic literature searches.
| result.setdefault("warnings", []).append(f"⚠️ Year adjusted: {year_warning.strip()}") | ||
|
|
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning message formats year adjustments but doesn't indicate which parameter was adjusted when both year_from and year_to are adjusted. Consider making the warning message clearer about which specific parameter was changed.
| result.setdefault("warnings", []).append(f"⚠️ Year adjusted: {year_warning.strip()}") | |
| year_adjustments: list[str] = [] | |
| if year_from is not None and year_from_int is not None and str(year_from) != str(year_from_int): | |
| year_adjustments.append(f"year_from adjusted from {year_from} to {year_from_int}") | |
| if year_to is not None and year_to_int is not None and str(year_to) != str(year_to_int): | |
| year_adjustments.append(f"year_to adjusted from {year_to} to {year_to_int}") | |
| if year_adjustments: | |
| warning_message = f"⚠️ Year adjusted: {'; '.join(year_adjustments)}" | |
| else: | |
| warning_message = f"⚠️ Year adjusted: {year_warning.strip()}" | |
| result.setdefault("warnings", []).append(warning_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - 我发现了 5 个问题,并且留下了一些高层次的反馈:
- 在
OpenAlexClient._request_with_retry中,除了 403 以外的 4xx 响应会执行response.raise_for_status(),然后被通用的except Exception捕获,导致发生重试;但注释里说 4xx(403 除外)不应该重试——建议对不可重试的状态码显式处理httpx.HTTPStatusError,让它们快速失败。 - 测试和
ConcreteClient假设所有文献客户端都有通用的get_by_doi/get_by_id接口,但BaseLiteratureClient只把search定义为抽象方法,而OpenAlexClient并没有实现这些方法;更清晰的做法要么是在BaseLiteratureClient上把这些方法也定义为抽象方法并在OpenAlexClient中实现,要么调整测试和辅助函数,只依赖共享接口。
给 AI Agent 的提示
请根据下面的代码审查意见进行修改:
## 总体意见
- 在 `OpenAlexClient._request_with_retry` 中,除了 403 以外的 4xx 响应会执行 `response.raise_for_status()`,然后被通用的 `except Exception` 捕获,导致发生重试;但注释里说 4xx(403 除外)不应该重试——建议对不可重试的状态码显式处理 `httpx.HTTPStatusError`,让它们快速失败。
- 测试和 `ConcreteClient` 假设所有文献客户端都有通用的 `get_by_doi`/`get_by_id` 接口,但 `BaseLiteratureClient` 只把 `search` 定义为抽象方法,而 `OpenAlexClient` 并没有实现这些方法;更清晰的做法要么是在 `BaseLiteratureClient` 上把这些方法也定义为抽象方法并在 `OpenAlexClient` 中实现,要么调整测试和辅助函数,只依赖共享接口。
## 逐条评论
### 评论 1
<location> `service/app/utils/literature/openalex_client.py:266-275` </location>
<code_context>
+ sections.append("## Complete Works List (JSON)\n")
+ if include_abstract:
+ sections.append("The following JSON contains all works with full abstracts:\n")
+ else:
+ sections.append("The following JSON contains all works (abstracts excluded to save tokens):\n")
+ sections.append("```json")
</code_context>
<issue_to_address>
**issue (bug_risk):** 4xx 错误(403 以外)会被重试,与注释所述行为不符
在 `_request_with_retry` 中,非 403 的 4xx 响应会触发 `response.raise_for_status()`,从而抛出 `httpx.HTTPStatusError`,然后被宽泛的 `except Exception` 捕获,导致在 `attempt < self.MAX_RETRIES - 1` 之前持续重试。这与“不要在 4xx(403 除外)上重试”的预期行为相矛盾,并且会浪费配额 / 增加延迟。建议显式处理 `httpx.HTTPStatusError`,对于 4xx(403 除外)立即重新抛出,只对 5xx/403/超时应用退避重试。
</issue_to_address>
### 评论 2
<location> `service/app/utils/literature/openalex_client.py:377-378` </location>
<code_context>
+ async with self.rate_limiter:
+ try:
+ url = f"{self.BASE_URL}/works"
+ page_params = {**params, "page": str(page)}
+ response = await self._request_with_retry(url, page_params)
+
+ works = response.get("results", [])
</code_context>
<issue_to_address>
**suggestion (performance):** 即使 `max_results` 很小,分页大小仍然始终为 MAX_PER_PAGE,导致不必要的数据传输
在 `_fetch_all_pages` 中,每个请求都使用 `per-page=MAX_PER_PAGE`(200),因此即便 `max_results` 很小(例如 10),你也会获取 200 条记录然后再做切片。为避免不必要的带宽和延迟,建议每次请求将 `per-page` 调整为 `min(self.MAX_PER_PAGE, max_results - len(all_works))`,这样最后一页(或唯一一页)只获取所需数量的数据。
建议的实现:
```python
all_works: list[dict[str, Any]] = []
page = 1
while len(all_works) < max_results:
remaining_results = max_results - len(all_works)
per_page = min(self.MAX_PER_PAGE, remaining_results)
async with self.rate_limiter:
```
```python
url = f"{self.BASE_URL}/works"
page_params = {
**params,
"page": str(page),
"per-page": str(per_page),
}
```
如果 `params` 在该方法更早的位置(或由调用方)构造时包含了固定的 `"per-page"` 值,请确保:
1. 要么从基础 `params` 中移除该固定的 `"per-page"`,完全依赖这里的覆盖;
2. 要么你确实希望在每次请求时由这里的 `"per-page"` 动态覆盖基础值(上述代码默认假定覆盖是可以接受的)。
</issue_to_address>
### 评论 3
<location> `service/tests/unit/test_utils/test_openalex_client.py:338-352` </location>
<code_context>
+ assert params["per-page"] == "200"
+ assert params["mailto"] == "test@example.com"
+
+ def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
+ """Test building query parameters with filters."""
+ request = SearchRequest(
+ query="machine learning",
+ year_from=2015,
+ year_to=2021,
+ is_oa=True,
+ work_type="journal-article",
+ )
+ params = client._build_query_params(request, None, None, None)
+
+ assert "filter" in params
+ assert "publication_year:2015-2021" in params["filter"]
+ assert "is_oa:true" in params["filter"]
+ assert "type:journal-article" in params["filter"]
+
+ def test_build_query_params_with_resolved_ids(self, client: OpenAlexClient) -> None:
</code_context>
<issue_to_address>
**suggestion (testing):** 为 `_build_query_params` 中的语言、撤稿、摘要和全文过滤器增加测试覆盖
`_build_query_params` 已经处理了 `language`、`is_retracted`、`has_abstract` 和 `has_fulltext`,但测试目前只覆盖了 `publication_year`、`is_oa` 和 `work_type`。请新增或扩展一个测试,在 `SearchRequest` 上设置这四个字段,并断言 `filter` 字符串中包含 `language:<code>`、`is_retracted:<bool>`、`has_abstract:<bool>` 和 `has_fulltext:<bool>`,以避免回归。
```suggestion
def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
"""Test building query parameters with filters."""
request = SearchRequest(
query="machine learning",
year_from=2015,
year_to=2021,
is_oa=True,
work_type="journal-article",
language="en",
is_retracted=False,
has_abstract=True,
has_fulltext=False,
)
params = client._build_query_params(request, None, None, None)
assert "filter" in params
assert "publication_year:2015-2021" in params["filter"]
assert "is_oa:true" in params["filter"]
assert "type:journal-article" in params["filter"]
assert "language:en" in params["filter"]
assert "is_retracted:false" in params["filter"]
assert "has_abstract:true" in params["filter"]
assert "has_fulltext:false" in params["filter"]
```
</issue_to_address>
### 评论 4
<location> `service/tests/unit/test_utils/test_openalex_client.py:49-58` </location>
<code_context>
+class TestOpenAlexClientSearch:
</code_context>
<issue_to_address>
**suggestion (testing):** 考虑为 `_fetch_all_pages` 的分页行为和终止条件添加测试
当前测试通过模拟 `_request_with_retry` 返回单页结果,因此没有真正覆盖 `_fetch_all_pages` 的分页逻辑。请添加测试以:
- 模拟多页结果,并断言返回的文献总数;
- 验证在累计结果达到 `meta.count` 时的提前终止路径;
- 覆盖 `_request_with_retry` 抛出异常时循环退出的错误路径。
这些测试可以直接针对 `_fetch_all_pages`(通过模拟 `_request_with_retry`)编写,也可以通过设置合适的 mock 从 `search` 入口调用。
建议实现:
```python
class TestOpenAlexClientSearch:
"""Test OpenAlex search functionality."""
```
请将以下测试添加到 `class TestOpenAlexClientSearch` 中,放在现有 fixture 和与 `search` 相关的测试之后:
```python
def test_fetch_all_pages_multiple_pages(self, client, mocker) -> None:
"""_fetch_all_pages returns combined results from multiple pages."""
# Two pages of 2 results each
page1 = {
"results": [{"id": "W1"}, {"id": "W2"}],
"meta": {"count": 4, "page": 1, "per_page": 2},
}
page2 = {
"results": [{"id": "W3"}, {"id": "W4"}],
"meta": {"count": 4, "page": 2, "per_page": 2},
}
mock_request = mocker.patch.object(
client,
"_request_with_retry",
side_effect=[page1, page2],
)
works = list(
client._fetch_all_pages(
endpoint="works",
params={"search": "test"},
per_page=2,
max_results=10,
)
)
# All 4 works from 2 pages should be returned
assert [w["id"] for w in works] == ["W1", "W2", "W3", "W4"]
assert mock_request.call_count == 2
def test_fetch_all_pages_terminates_at_meta_count(self, client, mocker) -> None:
"""_fetch_all_pages stops when accumulated results reach meta.count."""
# meta.count indicates only 3 items even if pages could return more
page1 = {
"results": [{"id": "W1"}, {"id": "W2"}],
"meta": {"count": 3, "page": 1, "per_page": 2},
}
page2 = {
"results": [{"id": "W3"}, {"id": "W4"}],
"meta": {"count": 3, "page": 2, "per_page": 2},
}
mock_request = mocker.patch.object(
client,
"_request_with_retry",
side_effect=[page1, page2],
)
works = list(
client._fetch_all_pages(
endpoint="works",
params={"search": "test"},
per_page=2,
max_results=10,
)
)
# Only first 3 results should be yielded even though second page has 2
assert [w["id"] for w in works] == ["W1", "W2", "W3"]
assert mock_request.call_count == 2
def test_fetch_all_pages_stops_on_request_error(self, client, mocker) -> None:
"""_fetch_all_pages exits when a request error occurs."""
page1 = {
"results": [{"id": "W1"}],
"meta": {"count": 10, "page": 1, "per_page": 1},
}
def side_effect(*args, **kwargs):
# First call returns a valid page, second call raises
if side_effect.call_count == 0:
side_effect.call_count += 1
return page1
raise RuntimeError("API error")
side_effect.call_count = 0
mock_request = mocker.patch.object(
client,
"_request_with_retry",
side_effect=side_effect,
)
# If _fetch_all_pages propagates the error, this asserts the error path.
# If instead it swallows the error and returns partial results, change this
# to assert the returned list and that call_count == 2.
with pytest.raises(RuntimeError):
list(
client._fetch_all_pages(
endpoint="works",
params={"search": "test"},
per_page=1,
max_results=10,
)
)
assert mock_request.call_count == 2
```
需要与现有实现核对的一些注意事项:
1. **函数签名**:这些测试假设 `_fetch_all_pages` 的签名类似于:
```python
def _fetch_all_pages(self, endpoint: str, params: dict, per_page: int, max_results: int | None = None) -> Iterable[dict]:
...
```
如果实际签名不同(例如参数名或位置不同),请相应调整调用方式。
2. **错误处理语义**:
- 如果 `_fetch_all_pages` 会“吞掉” `_request_with_retry` 抛出的异常并返回部分结果,请将 `pytest.raises` 替换为对部分 `works` 列表的断言(例如只包含 `["W1"]`),并保留对 `call_count` 的断言;
- 如果它总是立刻向外传播异常而不会进行第二次调用,请把期望的 `call_count` 调整为 `1`。
3. **测试工具**:
- 这些测试使用了 `pytest-mock` 提供的 `mocker` fixture,这通常与 pytest 一起使用。如果项目直接使用 `unittest.mock`,请将 `mocker.patch.object(...)` 替换为 `with mock.patch.object(...)` 或等价的写法。
一旦根据 `_fetch_all_pages` 的真实行为和签名完成整合,这些测试将:
- 覆盖多页结果聚合;
- 验证在达到 `meta.count` 时的提前终止;
- 覆盖 `_request_with_retry` 抛出异常时的错误路径。
</issue_to_address>
### 评论 5
<location> `service/tests/unit/test_utils/test_openalex_client.py:484-493` </location>
<code_context>
+class TestOpenAlexClientRequestWithRetry:
</code_context>
<issue_to_address>
**suggestion (testing):** 扩展 `_request_with_retry` 的测试,覆盖不可重试的 4xx 响应和重试次数耗尽的情况
请另外添加以下测试:
- 对非 403 的 4xx 响应(例如 400/404),断言会调用 `response.raise_for_status()` 且不会发生重试;
- 针对可重试错误(例如连续的 500 或超时)重试直至耗尽 `MAX_RETRIES`,并断言在方法抛出异常时,`client.get` 恰好被调用了 `MAX_RETRIES` 次。
这些测试会将预期的重试语义固定下来,防止将来的修改引入回归。
建议实现:
```python
class TestOpenAlexClientRequestWithRetry:
"""Test OpenAlex client request retry logic."""
@pytest.fixture
def client(self) -> OpenAlexClient:
"""Create an OpenAlex client for testing."""
return OpenAlexClient(email="test@example.com")
@pytest.mark.asyncio
async def test_request_with_retry_success(
self,
client: OpenAlexClient,
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""Test successful request without retry."""
calls: list[None] = []
async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
calls.append(None)
class MockResponse:
status_code = 200
def raise_for_status(self) -> None:
return None
return MockResponse()
# Patch underlying httpx.AsyncClient.get
monkeypatch.setattr(client._client, "get", mock_get)
response = await client._request_with_retry("/works")
assert response.status_code == 200
# Ensure we don't retry on success
assert len(calls) == 1
@pytest.mark.asyncio
async def test_request_with_retry_non_retryable_4xx(
self,
client: OpenAlexClient,
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""
Non-403 4xx responses should not be retried and should raise immediately
via response.raise_for_status().
"""
calls: list[None] = []
async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
calls.append(None)
request = httpx.Request("GET", "https://api.openalex.org/works")
response = httpx.Response(400, request=request)
def raise_for_status() -> None:
raise httpx.HTTPStatusError(
"Bad Request",
request=request,
response=response,
)
# Attach the custom raise_for_status to the response instance
response.raise_for_status = raise_for_status # type: ignore[assignment]
return response
monkeypatch.setattr(client._client, "get", mock_get)
with pytest.raises(httpx.HTTPStatusError):
await client._request_with_retry("/works")
# Ensure a non-retriable 4xx results in no retries
assert len(calls) == 1
@pytest.mark.asyncio
async def test_request_with_retry_exhausts_retries_on_5xx(
self,
client: OpenAlexClient,
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""
Retryable errors (e.g., 5xx) should be retried up to MAX_RETRIES
and then the method should raise.
"""
max_retries = getattr(client, "MAX_RETRIES", 3)
call_count = 0
async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
nonlocal call_count
call_count += 1
request = httpx.Request("GET", "https://api.openalex.org/works")
response = httpx.Response(500, request=request)
def raise_for_status() -> None:
raise httpx.HTTPStatusError(
"Server error",
request=request,
response=response,
)
response.raise_for_status = raise_for_status # type: ignore[assignment]
return response
monkeypatch.setattr(client._client, "get", mock_get)
with pytest.raises(httpx.HTTPStatusError):
await client._request_with_retry("/works")
# Should attempt exactly MAX_RETRIES times before giving up
assert call_count == max_retries
```
1. 确保在 `service/tests/unit/test_utils/test_openalex_client.py` 顶部已导入 `httpx`,例如:`import httpx`。
2. 如果 `OpenAlexClient` 暴露重试次数的属性或常量名不是 `MAX_RETRIES`(例如 `_MAX_RETRIES` 或 `RETRY_ATTEMPTS`),请将 `getattr(client, "MAX_RETRIES", 3)` 和相关断言改为使用正确的属性,以保证测试与实现保持一致。
3. `_request_with_retry` 的签名当前假定接受一个类似 `"/works"` 的路径参数。如果实际 API 不同(例如需要完整 URL 或额外必填参数),请相应调整测试中的调用。
</issue_to_address>帮我变得更有用!请对每条评论点 👍 或 👎,我会根据这些反馈改进后续的审查质量。
Original comment in English
Hey - I've found 5 issues, and left some high level feedback:
- In
OpenAlexClient._request_with_retry, 4xx responses other than 403 end up going throughresponse.raise_for_status()and are then caught by the genericexcept Exceptionblock, causing retries despite the docstring saying 4xx (except 403) should not be retried—consider handlinghttpx.HTTPStatusErrorfor non-retryable status codes explicitly so they fail fast. - The tests and
ConcreteClientassume a commonget_by_doi/get_by_idinterface on literature clients, butBaseLiteratureClientonly definessearchas abstract andOpenAlexClientdoes not implement those methods; it would be clearer to either add these as abstract methods onBaseLiteratureClientand implement them inOpenAlexClient, or adjust tests and helpers to rely only on the shared interface.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `OpenAlexClient._request_with_retry`, 4xx responses other than 403 end up going through `response.raise_for_status()` and are then caught by the generic `except Exception` block, causing retries despite the docstring saying 4xx (except 403) should not be retried—consider handling `httpx.HTTPStatusError` for non-retryable status codes explicitly so they fail fast.
- The tests and `ConcreteClient` assume a common `get_by_doi`/`get_by_id` interface on literature clients, but `BaseLiteratureClient` only defines `search` as abstract and `OpenAlexClient` does not implement those methods; it would be clearer to either add these as abstract methods on `BaseLiteratureClient` and implement them in `OpenAlexClient`, or adjust tests and helpers to rely only on the shared interface.
## Individual Comments
### Comment 1
<location> `service/app/utils/literature/openalex_client.py:266-275` </location>
<code_context>
+ sections.append("## Complete Works List (JSON)\n")
+ if include_abstract:
+ sections.append("The following JSON contains all works with full abstracts:\n")
+ else:
+ sections.append("The following JSON contains all works (abstracts excluded to save tokens):\n")
+ sections.append("```json")
</code_context>
<issue_to_address>
**issue (bug_risk):** 4xx errors (other than 403) are retried despite the comment saying they should not be
In `_request_with_retry`, non-403 4xx responses trigger `response.raise_for_status()`, which raises `httpx.HTTPStatusError` and is then caught by the broad `except Exception`, causing retries until `attempt < self.MAX_RETRIES - 1`. This contradicts the "Don't retry on 4xx (except 403)" behavior and can waste quota / add latency. Consider handling `httpx.HTTPStatusError` explicitly and re-raising immediately for 4xx (except 403), only applying backoff for 5xx/403/timeouts.
</issue_to_address>
### Comment 2
<location> `service/app/utils/literature/openalex_client.py:377-378` </location>
<code_context>
+ async with self.rate_limiter:
+ try:
+ url = f"{self.BASE_URL}/works"
+ page_params = {**params, "page": str(page)}
+ response = await self._request_with_retry(url, page_params)
+
+ works = response.get("results", [])
</code_context>
<issue_to_address>
**suggestion (performance):** Page size is always MAX_PER_PAGE even when max_results is small, causing unnecessary data transfer
In `_fetch_all_pages`, each request always uses `per-page=MAX_PER_PAGE` (200), so even when `max_results` is small (e.g., 10) you fetch 200 items and slice later. To avoid unnecessary bandwidth and latency, adjust `per-page` for each request to `min(self.MAX_PER_PAGE, max_results - len(all_works))`, so the last (or only) page only fetches what’s needed.
Suggested implementation:
```python
all_works: list[dict[str, Any]] = []
page = 1
while len(all_works) < max_results:
remaining_results = max_results - len(all_works)
per_page = min(self.MAX_PER_PAGE, remaining_results)
async with self.rate_limiter:
```
```python
url = f"{self.BASE_URL}/works"
page_params = {
**params,
"page": str(page),
"per-page": str(per_page),
}
```
If `params` is constructed earlier in this method (or by the caller) with a fixed `"per-page"` value, ensure that:
1. Either that fixed `"per-page"` is removed from the base `params`, relying on this override.
2. Or you are comfortable with this dynamic `"per-page"` in `page_params` intentionally overriding the base value on a per-request basis (the current code above assumes overriding is acceptable).
</issue_to_address>
### Comment 3
<location> `service/tests/unit/test_utils/test_openalex_client.py:338-352` </location>
<code_context>
+ assert params["per-page"] == "200"
+ assert params["mailto"] == "test@example.com"
+
+ def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
+ """Test building query parameters with filters."""
+ request = SearchRequest(
+ query="machine learning",
+ year_from=2015,
+ year_to=2021,
+ is_oa=True,
+ work_type="journal-article",
+ )
+ params = client._build_query_params(request, None, None, None)
+
+ assert "filter" in params
+ assert "publication_year:2015-2021" in params["filter"]
+ assert "is_oa:true" in params["filter"]
+ assert "type:journal-article" in params["filter"]
+
+ def test_build_query_params_with_resolved_ids(self, client: OpenAlexClient) -> None:
</code_context>
<issue_to_address>
**suggestion (testing):** Add coverage for language, retraction, abstract, and fulltext filters in `_build_query_params`.
`_build_query_params` already handles `language`, `is_retracted`, `has_abstract`, and `has_fulltext`, but the tests only cover `publication_year`, `is_oa`, and `work_type`. Please add or extend a test that sets all four of these fields on `SearchRequest` and asserts that `language:<code>`, `is_retracted:<bool>`, `has_abstract:<bool>`, and `has_fulltext:<bool>` appear in the `filter` string to guard against regressions.
```suggestion
def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None:
"""Test building query parameters with filters."""
request = SearchRequest(
query="machine learning",
year_from=2015,
year_to=2021,
is_oa=True,
work_type="journal-article",
language="en",
is_retracted=False,
has_abstract=True,
has_fulltext=False,
)
params = client._build_query_params(request, None, None, None)
assert "filter" in params
assert "publication_year:2015-2021" in params["filter"]
assert "is_oa:true" in params["filter"]
assert "type:journal-article" in params["filter"]
assert "language:en" in params["filter"]
assert "is_retracted:false" in params["filter"]
assert "has_abstract:true" in params["filter"]
assert "has_fulltext:false" in params["filter"]
```
</issue_to_address>
### Comment 4
<location> `service/tests/unit/test_utils/test_openalex_client.py:49-58` </location>
<code_context>
+class TestOpenAlexClientSearch:
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding tests for `_fetch_all_pages` pagination behavior and termination conditions.
Current tests mock `_request_with_retry` to return a single page, so `_fetch_all_pages`’s pagination logic isn’t exercised. Please add tests that:
- simulate multiple pages and assert the total number of works returned
- verify the early-termination path when accumulated results reach `meta.count`
- cover the error path where `_request_with_retry` raises and the loop exits
These can target `_fetch_all_pages` directly (with a mocked `_request_with_retry`) or go through `search` with an appropriate mock setup.
Suggested implementation:
```python
class TestOpenAlexClientSearch:
"""Test OpenAlex search functionality."""
```
Please add the following tests inside `class TestOpenAlexClientSearch`, after the existing fixtures and existing `search`-related tests:
```python
def test_fetch_all_pages_multiple_pages(self, client, mocker) -> None:
"""_fetch_all_pages returns combined results from multiple pages."""
# Two pages of 2 results each
page1 = {
"results": [{"id": "W1"}, {"id": "W2"}],
"meta": {"count": 4, "page": 1, "per_page": 2},
}
page2 = {
"results": [{"id": "W3"}, {"id": "W4"}],
"meta": {"count": 4, "page": 2, "per_page": 2},
}
mock_request = mocker.patch.object(
client,
"_request_with_retry",
side_effect=[page1, page2],
)
works = list(
client._fetch_all_pages(
endpoint="works",
params={"search": "test"},
per_page=2,
max_results=10,
)
)
# All 4 works from 2 pages should be returned
assert [w["id"] for w in works] == ["W1", "W2", "W3", "W4"]
assert mock_request.call_count == 2
def test_fetch_all_pages_terminates_at_meta_count(self, client, mocker) -> None:
"""_fetch_all_pages stops when accumulated results reach meta.count."""
# meta.count indicates only 3 items even if pages could return more
page1 = {
"results": [{"id": "W1"}, {"id": "W2"}],
"meta": {"count": 3, "page": 1, "per_page": 2},
}
page2 = {
"results": [{"id": "W3"}, {"id": "W4"}],
"meta": {"count": 3, "page": 2, "per_page": 2},
}
mock_request = mocker.patch.object(
client,
"_request_with_retry",
side_effect=[page1, page2],
)
works = list(
client._fetch_all_pages(
endpoint="works",
params={"search": "test"},
per_page=2,
max_results=10,
)
)
# Only first 3 results should be yielded even though second page has 2
assert [w["id"] for w in works] == ["W1", "W2", "W3"]
assert mock_request.call_count == 2
def test_fetch_all_pages_stops_on_request_error(self, client, mocker) -> None:
"""_fetch_all_pages exits when a request error occurs."""
page1 = {
"results": [{"id": "W1"}],
"meta": {"count": 10, "page": 1, "per_page": 1},
}
def side_effect(*args, **kwargs):
# First call returns a valid page, second call raises
if side_effect.call_count == 0:
side_effect.call_count += 1
return page1
raise RuntimeError("API error")
side_effect.call_count = 0
mock_request = mocker.patch.object(
client,
"_request_with_retry",
side_effect=side_effect,
)
# If _fetch_all_pages propagates the error, this asserts the error path.
# If instead it swallows the error and returns partial results, change this
# to assert the returned list and that call_count == 2.
with pytest.raises(RuntimeError):
list(
client._fetch_all_pages(
endpoint="works",
params={"search": "test"},
per_page=1,
max_results=10,
)
)
assert mock_request.call_count == 2
```
Notes / things to verify against the existing implementation:
1. **Signature**: These tests assume `_fetch_all_pages` has a signature like:
```python
def _fetch_all_pages(self, endpoint: str, params: dict, per_page: int, max_results: int | None = None) -> Iterable[dict]:
...
```
If it differs (e.g., parameter names or positions), adjust the calls accordingly.
2. **Error handling semantics**:
- If `_fetch_all_pages` *swallows* exceptions from `_request_with_retry` and returns partial results, replace the `pytest.raises` block with assertions on the partial `works` list (e.g., only `["W1"]`) and keep the `call_count` assertion.
- If it *always* propagates exceptions immediately without a second call, adjust the expected `call_count` to `1`.
3. **Test tools**:
- These tests use the `mocker` fixture from `pytest-mock`, which is commonly used alongside pytest. If the project uses `unittest.mock` directly instead, convert the `mocker.patch.object(...)` calls to `with mock.patch.object(...)` or an equivalent pattern.
Once integrated with the correct `_fetch_all_pages` behavior and signature, these tests will:
- Exercise multi-page aggregation,
- Verify early termination when `meta.count` is reached,
- And cover the error path when `_request_with_retry` raises.
</issue_to_address>
### Comment 5
<location> `service/tests/unit/test_utils/test_openalex_client.py:484-493` </location>
<code_context>
+class TestOpenAlexClientRequestWithRetry:
</code_context>
<issue_to_address>
**suggestion (testing):** Extend `_request_with_retry` tests to cover non-retriable 4xx responses and max-retries exhaustion.
Please also add tests for:
- A non-403 4xx response (e.g., 400/404) to assert `response.raise_for_status()` is invoked and no retries occur.
- Exhaustion of retryable errors (e.g., repeated 500s or timeouts) to assert the method raises after `MAX_RETRIES` and `client.get` is called exactly `MAX_RETRIES` times.
These will lock in the intended retry semantics for future changes.
Suggested implementation:
```python
class TestOpenAlexClientRequestWithRetry:
"""Test OpenAlex client request retry logic."""
@pytest.fixture
def client(self) -> OpenAlexClient:
"""Create an OpenAlex client for testing."""
return OpenAlexClient(email="test@example.com")
@pytest.mark.asyncio
async def test_request_with_retry_success(
self,
client: OpenAlexClient,
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""Test successful request without retry."""
calls: list[None] = []
async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
calls.append(None)
class MockResponse:
status_code = 200
def raise_for_status(self) -> None:
return None
return MockResponse()
# Patch underlying httpx.AsyncClient.get
monkeypatch.setattr(client._client, "get", mock_get)
response = await client._request_with_retry("/works")
assert response.status_code == 200
# Ensure we don't retry on success
assert len(calls) == 1
@pytest.mark.asyncio
async def test_request_with_retry_non_retryable_4xx(
self,
client: OpenAlexClient,
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""
Non-403 4xx responses should not be retried and should raise immediately
via response.raise_for_status().
"""
calls: list[None] = []
async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
calls.append(None)
request = httpx.Request("GET", "https://api.openalex.org/works")
response = httpx.Response(400, request=request)
def raise_for_status() -> None:
raise httpx.HTTPStatusError(
"Bad Request",
request=request,
response=response,
)
# Attach the custom raise_for_status to the response instance
response.raise_for_status = raise_for_status # type: ignore[assignment]
return response
monkeypatch.setattr(client._client, "get", mock_get)
with pytest.raises(httpx.HTTPStatusError):
await client._request_with_retry("/works")
# Ensure a non-retriable 4xx results in no retries
assert len(calls) == 1
@pytest.mark.asyncio
async def test_request_with_retry_exhausts_retries_on_5xx(
self,
client: OpenAlexClient,
monkeypatch: pytest.MonkeyPatch,
) -> None:
"""
Retryable errors (e.g., 5xx) should be retried up to MAX_RETRIES
and then the method should raise.
"""
max_retries = getattr(client, "MAX_RETRIES", 3)
call_count = 0
async def mock_get(url: str, params: dict | None = None, timeout: float | None = None):
nonlocal call_count
call_count += 1
request = httpx.Request("GET", "https://api.openalex.org/works")
response = httpx.Response(500, request=request)
def raise_for_status() -> None:
raise httpx.HTTPStatusError(
"Server error",
request=request,
response=response,
)
response.raise_for_status = raise_for_status # type: ignore[assignment]
return response
monkeypatch.setattr(client._client, "get", mock_get)
with pytest.raises(httpx.HTTPStatusError):
await client._request_with_retry("/works")
# Should attempt exactly MAX_RETRIES times before giving up
assert call_count == max_retries
```
1. Ensure `httpx` is imported at the top of `service/tests/unit/test_utils/test_openalex_client.py`, for example:
`import httpx`.
2. If `OpenAlexClient` exposes the retry count via a different attribute or constant name than `MAX_RETRIES`
(e.g., `_MAX_RETRIES` or `RETRY_ATTEMPTS`), adjust `getattr(client, "MAX_RETRIES", 3)` and the assertion
to use the correct attribute so the test matches the implementation.
3. The signature of `_request_with_retry` is assumed to accept a path like `"/works"`. If the actual API differs
(e.g., it expects a full URL or additional required parameters), adapt the calls in the tests accordingly.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| else: | ||
| msg = ( | ||
| f"⚠️ Author '{author_name}' not found. " | ||
| f"Suggestions: (1) Try full name format like 'Smith, John' or 'John Smith', " | ||
| f"(2) Check spelling, (3) Try removing middle name/initial." | ||
| ) | ||
| logger.warning(msg) | ||
| return None, False, msg | ||
| except Exception as e: | ||
| msg = f"⚠️ Failed to resolve author '{author_name}': {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): 4xx 错误(403 以外)会被重试,与注释所述行为不符
在 _request_with_retry 中,非 403 的 4xx 响应会触发 response.raise_for_status(),从而抛出 httpx.HTTPStatusError,然后被宽泛的 except Exception 捕获,导致在 attempt < self.MAX_RETRIES - 1 之前持续重试。这与“不要在 4xx(403 除外)上重试”的预期行为相矛盾,并且会浪费配额 / 增加延迟。建议显式处理 httpx.HTTPStatusError,对于 4xx(403 除外)立即重新抛出,只对 5xx/403/超时应用退避重试。
Original comment in English
issue (bug_risk): 4xx errors (other than 403) are retried despite the comment saying they should not be
In _request_with_retry, non-403 4xx responses trigger response.raise_for_status(), which raises httpx.HTTPStatusError and is then caught by the broad except Exception, causing retries until attempt < self.MAX_RETRIES - 1. This contradicts the "Don't retry on 4xx (except 403)" behavior and can waste quota / add latency. Consider handling httpx.HTTPStatusError explicitly and re-raising immediately for 4xx (except 403), only applying backoff for 5xx/403/timeouts.
| page_params = {**params, "page": str(page)} | ||
| response = await self._request_with_retry(url, page_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (performance): 即使 max_results 很小,分页大小仍然始终为 MAX_PER_PAGE,导致不必要的数据传输
在 _fetch_all_pages 中,每个请求都使用 per-page=MAX_PER_PAGE(200),因此即便 max_results 很小(例如 10),你也会获取 200 条记录然后再做切片。为避免不必要的带宽和延迟,建议每次请求将 per-page 调整为 min(self.MAX_PER_PAGE, max_results - len(all_works)),这样最后一页(或唯一一页)只获取所需数量的数据。
建议的实现:
all_works: list[dict[str, Any]] = []
page = 1
while len(all_works) < max_results:
remaining_results = max_results - len(all_works)
per_page = min(self.MAX_PER_PAGE, remaining_results)
async with self.rate_limiter: url = f"{self.BASE_URL}/works"
page_params = {
**params,
"page": str(page),
"per-page": str(per_page),
}如果 params 在该方法更早的位置(或由调用方)构造时包含了固定的 "per-page" 值,请确保:
- 要么从基础
params中移除该固定的"per-page",完全依赖这里的覆盖; - 要么你确实希望在每次请求时由这里的
"per-page"动态覆盖基础值(上述代码默认假定覆盖是可以接受的)。
Original comment in English
suggestion (performance): Page size is always MAX_PER_PAGE even when max_results is small, causing unnecessary data transfer
In _fetch_all_pages, each request always uses per-page=MAX_PER_PAGE (200), so even when max_results is small (e.g., 10) you fetch 200 items and slice later. To avoid unnecessary bandwidth and latency, adjust per-page for each request to min(self.MAX_PER_PAGE, max_results - len(all_works)), so the last (or only) page only fetches what’s needed.
Suggested implementation:
all_works: list[dict[str, Any]] = []
page = 1
while len(all_works) < max_results:
remaining_results = max_results - len(all_works)
per_page = min(self.MAX_PER_PAGE, remaining_results)
async with self.rate_limiter: url = f"{self.BASE_URL}/works"
page_params = {
**params,
"page": str(page),
"per-page": str(per_page),
}If params is constructed earlier in this method (or by the caller) with a fixed "per-page" value, ensure that:
- Either that fixed
"per-page"is removed from the baseparams, relying on this override. - Or you are comfortable with this dynamic
"per-page"inpage_paramsintentionally overriding the base value on a per-request basis (the current code above assumes overriding is acceptable).
| def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None: | ||
| """Test building query parameters with filters.""" | ||
| request = SearchRequest( | ||
| query="machine learning", | ||
| year_from=2015, | ||
| year_to=2021, | ||
| is_oa=True, | ||
| work_type="journal-article", | ||
| ) | ||
| params = client._build_query_params(request, None, None, None) | ||
|
|
||
| assert "filter" in params | ||
| assert "publication_year:2015-2021" in params["filter"] | ||
| assert "is_oa:true" in params["filter"] | ||
| assert "type:journal-article" in params["filter"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): 为 _build_query_params 中的语言、撤稿、摘要和全文过滤器增加测试覆盖。
_build_query_params 已经处理了 language、is_retracted、has_abstract 和 has_fulltext,但测试目前只覆盖了 publication_year、is_oa 和 work_type。请新增或扩展一个测试,在 SearchRequest 上设置这四个字段,并断言 filter 字符串中包含 language:<code>、is_retracted:<bool>、has_abstract:<bool> 和 has_fulltext:<bool>,以避免回归。
| def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None: | |
| """Test building query parameters with filters.""" | |
| request = SearchRequest( | |
| query="machine learning", | |
| year_from=2015, | |
| year_to=2021, | |
| is_oa=True, | |
| work_type="journal-article", | |
| ) | |
| params = client._build_query_params(request, None, None, None) | |
| assert "filter" in params | |
| assert "publication_year:2015-2021" in params["filter"] | |
| assert "is_oa:true" in params["filter"] | |
| assert "type:journal-article" in params["filter"] | |
| def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None: | |
| """Test building query parameters with filters.""" | |
| request = SearchRequest( | |
| query="machine learning", | |
| year_from=2015, | |
| year_to=2021, | |
| is_oa=True, | |
| work_type="journal-article", | |
| language="en", | |
| is_retracted=False, | |
| has_abstract=True, | |
| has_fulltext=False, | |
| ) | |
| params = client._build_query_params(request, None, None, None) | |
| assert "filter" in params | |
| assert "publication_year:2015-2021" in params["filter"] | |
| assert "is_oa:true" in params["filter"] | |
| assert "type:journal-article" in params["filter"] | |
| assert "language:en" in params["filter"] | |
| assert "is_retracted:false" in params["filter"] | |
| assert "has_abstract:true" in params["filter"] | |
| assert "has_fulltext:false" in params["filter"] |
Original comment in English
suggestion (testing): Add coverage for language, retraction, abstract, and fulltext filters in _build_query_params.
_build_query_params already handles language, is_retracted, has_abstract, and has_fulltext, but the tests only cover publication_year, is_oa, and work_type. Please add or extend a test that sets all four of these fields on SearchRequest and asserts that language:<code>, is_retracted:<bool>, has_abstract:<bool>, and has_fulltext:<bool> appear in the filter string to guard against regressions.
| def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None: | |
| """Test building query parameters with filters.""" | |
| request = SearchRequest( | |
| query="machine learning", | |
| year_from=2015, | |
| year_to=2021, | |
| is_oa=True, | |
| work_type="journal-article", | |
| ) | |
| params = client._build_query_params(request, None, None, None) | |
| assert "filter" in params | |
| assert "publication_year:2015-2021" in params["filter"] | |
| assert "is_oa:true" in params["filter"] | |
| assert "type:journal-article" in params["filter"] | |
| def test_build_query_params_with_filters(self, client: OpenAlexClient) -> None: | |
| """Test building query parameters with filters.""" | |
| request = SearchRequest( | |
| query="machine learning", | |
| year_from=2015, | |
| year_to=2021, | |
| is_oa=True, | |
| work_type="journal-article", | |
| language="en", | |
| is_retracted=False, | |
| has_abstract=True, | |
| has_fulltext=False, | |
| ) | |
| params = client._build_query_params(request, None, None, None) | |
| assert "filter" in params | |
| assert "publication_year:2015-2021" in params["filter"] | |
| assert "is_oa:true" in params["filter"] | |
| assert "type:journal-article" in params["filter"] | |
| assert "language:en" in params["filter"] | |
| assert "is_retracted:false" in params["filter"] | |
| assert "has_abstract:true" in params["filter"] | |
| assert "has_fulltext:false" in params["filter"] |
Summary by Sourcery
添加一个多源学术文献检索子系统,包括 MCP 服务器、OpenAlex API 客户端、共享模型、DOI 规范化/去重工具以及任务分发器,并配套完善的单元测试和文档。
New Features:
Enhancements:
Tests:
Original summary in English
Summary by Sourcery
Add a multi-source academic literature search subsystem, including an MCP server, OpenAlex API client, shared models, DOI normalization/deduplication utilities, and a work distributor, along with comprehensive unit tests and documentation.
New Features:
Enhancements:
Tests: