Daily Papers API #2554

hlky · 2024-09-19T00:57:01Z

This PR introduces list_papers using the Daily Papers API, search_papers using the papers/search endpoint and get_paper using papers/{paper_id} endpoint.

We add DailyPaper dataclass, containing Paper and associated metadata.

Paper dataclass containing metadata about the paper itself.

PaperAuthor dataclass containing metadata about the paper's author.

PaperAuthor's user and DailyPaper's submitted_by use existing User dataclass, although these contain fewer fields than User itself so could have their own dataclasses.

We add list_papers to HfApi which accepts date as str, YYYY-MM-DD is the expected format, this could also accept datetime as a parameter. The endpoint itself also accepts a full datetime in format %Y-%m-%dT%H:%M:%S.%fZ. Invalid dates will return HTTP 400.

We add PaperSearchInfo dataclass, containing minimal metadata, returned by search_papers.

We add search_papers to HfApi which accepts query as str, this can be a text query or arXiv paper ID.

We add get_paper to HfApi which accepts either paper_id as str or a PaperSearchInfo object with paper_search. Due to slight differences between the data returned from papers/{paper_id} and Daily Papers endpoint we add a static method from_get_paper to DailyPaper. Some fields are unavailable from papers/{paper_id}, namely thumbnail and numComments, when providing a PaperSearchInfo we copy thumbnail into the DailyPaper object.

We add tests test_papers_by_date, test_search_papers, test_get_paper_by_id, test_get_paper_by_paper_search_info under DailyPaperApiTest.

hlky · 2024-09-19T01:25:05Z

Here's an example DailyPaper:

DailyPaper(
    paper=Paper(
        paper_id="2409.11340",
        authors=[
            PaperAuthor(
                author_id="66ea3b25353c1b9b84254825",
                user=User(
                    username="Shitao",
                    fullname="Xiao",
                    avatar_url="/avatars/c0675d05a52192ee14e9ab1633353956.svg",
                    details=None,
                    is_following=None,
                    is_pro=False,
                    num_models=None,
                    num_datasets=None,
                    num_spaces=None,
                    num_discussions=None,
                    num_papers=None,
                    num_upvotes=None,
                    num_likes=None,
                    num_following=None,
                    num_followers=None,
                    orgs=[],
                ),
                name="Shitao Xiao",
                status="claimed_verified",
                status_changed_at=datetime.datetime(
                    2024, 9, 18, 7, 1, 29, 215000, tzinfo=datetime.timezone.utc
                ),
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ea3b25353c1b9b84254826",
                user=None,
                name="Yueze Wang",
                status="",
                status_changed_at=None,
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ea3b25353c1b9b84254827",
                user=User(
                    username="JUNJIE99",
                    fullname="JUNJIE ZHOU",
                    avatar_url="/avatars/42f09356a1282896573ccb44830cd327.svg",
                    details=None,
                    is_following=None,
                    is_pro=False,
                    num_models=None,
                    num_datasets=None,
                    num_spaces=None,
                    num_discussions=None,
                    num_papers=None,
                    num_upvotes=None,
                    num_likes=None,
                    num_following=None,
                    num_followers=None,
                    orgs=[],
                ),
                name="Junjie Zhou",
                status="claimed_verified",
                status_changed_at=datetime.datetime(
                    2024, 9, 18, 7, 1, 31, 41000, tzinfo=datetime.timezone.utc
                ),
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ea3b25353c1b9b84254828",
                user=User(
                    username="avery00",
                    fullname="huaying Yuan",
                    avatar_url="/avatars/2537cee66afecc2d999e05b01c78d319.svg",
                    details=None,
                    is_following=None,
                    is_pro=False,
                    num_models=None,
                    num_datasets=None,
                    num_spaces=None,
                    num_discussions=None,
                    num_papers=None,
                    num_upvotes=None,
                    num_likes=None,
                    num_following=None,
                    num_followers=None,
                    orgs=[],
                ),
                name="Huaying Yuan",
                status="admin_assigned",
                status_changed_at=datetime.datetime(
                    2024, 9, 18, 7, 12, 24, 40000, tzinfo=datetime.timezone.utc
                ),
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ea3b25353c1b9b84254829",
                user=None,
                name="Xingrun Xing",
                status="",
                status_changed_at=None,
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ea3b25353c1b9b8425482a",
                user=User(
                    username="Ruiran",
                    fullname="Ruiran Yan",
                    avatar_url="/avatars/26aef5944759c2e4366a71eb8c7fc50a.svg",
                    details=None,
                    is_following=None,
                    is_pro=False,
                    num_models=None,
                    num_datasets=None,
                    num_spaces=None,
                    num_discussions=None,
                    num_papers=None,
                    num_upvotes=None,
                    num_likes=None,
                    num_following=None,
                    num_followers=None,
                    orgs=[],
                ),
                name="Ruiran Yan",
                status="admin_assigned",
                status_changed_at=datetime.datetime(
                    2024, 9, 18, 7, 12, 36, 909000, tzinfo=datetime.timezone.utc
                ),
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ea3b25353c1b9b8425482b",
                user=User(
                    username="stingw",
                    fullname="Shu-Ting Wang",
                    avatar_url="/avatars/3486af06cc2c1562e09b04bb03360912.svg",
                    details=None,
                    is_following=None,
                    is_pro=False,
                    num_models=None,
                    num_datasets=None,
                    num_spaces=None,
                    num_discussions=None,
                    num_papers=None,
                    num_upvotes=None,
                    num_likes=None,
                    num_following=None,
                    num_followers=None,
                    orgs=[],
                ),
                name="Shuting Wang",
                status="admin_assigned",
                status_changed_at=datetime.datetime(
                    2024, 9, 18, 7, 12, 43, 24000, tzinfo=datetime.timezone.utc
                ),
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ea3b25353c1b9b8425482c",
                user=None,
                name="Tiejun Huang",
                status="",
                status_changed_at=None,
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ea3b25353c1b9b8425482d",
                user=User(
                    username="zl101",
                    fullname="zhengliu",
                    avatar_url="/avatars/ef13dc7ce243819bc0da9b04e778b432.svg",
                    details=None,
                    is_following=None,
                    is_pro=False,
                    num_models=None,
                    num_datasets=None,
                    num_spaces=None,
                    num_discussions=None,
                    num_papers=None,
                    num_upvotes=None,
                    num_likes=None,
                    num_following=None,
                    num_followers=None,
                    orgs=[],
                ),
                name="Zheng Liu",
                status="extracted_pending",
                status_changed_at=datetime.datetime(
                    2024, 9, 18, 2, 30, 1, 852000, tzinfo=datetime.timezone.utc
                ),
                hidden=False,
            ),
        ],
        published_at=datetime.datetime(
            2024, 9, 17, 16, 42, 46, tzinfo=datetime.timezone.utc
        ),
        title="OmniGen: Unified Image Generation",
        summary="In this work, we introduce OmniGen, a new diffusion model for unified image\ngeneration. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen\nno longer requires additional modules such as ControlNet or IP-Adapter to\nprocess diverse control conditions. OmniGenis characterized by the following\nfeatures: 1) Unification: OmniGen not only demonstrates text-to-image\ngeneration capabilities but also inherently supports other downstream tasks,\nsuch as image editing, subject-driven generation, and visual-conditional\ngeneration. Additionally, OmniGen can handle classical computer vision tasks by\ntransforming them into image generation tasks, such as edge detection and human\npose recognition. 2) Simplicity: The architecture of OmniGen is highly\nsimplified, eliminating the need for additional text encoders. Moreover, it is\nmore user-friendly compared to existing diffusion models, enabling complex\ntasks to be accomplished through instructions without the need for extra\npreprocessing steps (e.g., human pose estimation), thereby significantly\nsimplifying the workflow of image generation. 3) Knowledge Transfer: Through\nlearning in a unified format, OmniGen effectively transfers knowledge across\ndifferent tasks, manages unseen tasks and domains, and exhibits novel\ncapabilities. We also explore the model's reasoning capabilities and potential\napplications of chain-of-thought mechanism. This work represents the first\nattempt at a general-purpose image generation model, and there remain several\nunresolved issues. We will open-source the related resources at\nhttps://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.",
        upvotes=38,
        discussion_id="66ea3b29353c1b9b842549ac",
    ),
    published_at=datetime.datetime(
        2024, 9, 18, 1, 0, 6, 728000, tzinfo=datetime.timezone.utc
    ),
    title="OmniGen: Unified Image Generation",
    thumbnail="https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2409.11340.png",
    comments=3,
    submitted_by=User(
        username="",
        fullname="AK",
        avatar_url="https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg",
        details=None,
        is_following=None,
        is_pro=False,
        num_models=None,
        num_datasets=None,
        num_spaces=None,
        num_discussions=None,
        num_papers=None,
        num_upvotes=None,
        num_likes=None,
        num_following=None,
        num_followers=None,
        orgs=[],
    ),
)

Although not returned by the API we could add a link to arXiv page and PDF link, https://arxiv.org/abs/{paper_id} and https://arxiv.org/pdf/{paper_id} respectively.

Also the API doesn't appear to allow retrieval of the paper's discussion.

hlky · 2024-09-21T00:21:59Z

Example PaperSearchInfos from search_paper:

[
    PaperSearchInfo(
        paper_id="2409.07146",
        title="Gated Slot Attention for Efficient Linear-Time Sequence Modeling",
        thumbnail="https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2409.07146.png",
        source="hf",
    ),
    PaperSearchInfo(
        paper_id="2409.03752",
        title="Attention Heads of Large Language Models: A Survey",
        thumbnail="https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2409.03752.png",
        source="hf",
    ),
    ...
]

Example DailyPaper from get_paper with paper_id:

DailyPaper(
    paper=Paper(
        paper_id="2409.11074",
        authors=[
            PaperAuthor(
                author_id="66ead57361228b02f8144cdf",
                user=None,
                name="Adrian Cosma",
                status="",
                status_changed_at=None,
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ead57361228b02f8144ce0",
                user=None,
                name="Ana-Maria Bucur",
                status="",
                status_changed_at=None,
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ead57361228b02f8144ce1",
                user=None,
                name="Emilian Radoi",
                status="",
                status_changed_at=None,
                hidden=False,
            ),
        ],
        published_at=datetime.datetime(
            2024, 9, 17, 11, 3, 46, tzinfo=datetime.timezone.utc
        ),
        title="RoMath: A Mathematical Reasoning Benchmark in Romanian",
        summary="Mathematics has long been conveyed through natural language, primarily for\nhuman understanding. With the rise of mechanized mathematics and proof\nassistants, there is a growing need to understand informal mathematical text,\nyet most existing benchmarks focus solely on English, overlooking other\nlanguages. This paper introduces RoMath, a Romanian mathematical reasoning\nbenchmark suite comprising three datasets: RoMath-Baccalaureate,\nRoMath-Competitions and RoMath-Synthetic, which cover a range of mathematical\ndomains and difficulty levels, aiming to improve non-English language models\nand promote multilingual AI development. By focusing on Romanian, a\nlow-resource language with unique linguistic features, RoMath addresses the\nlimitations of Anglo-centric models and emphasizes the need for dedicated\nresources beyond simple automatic translation. We benchmark several open-weight\nlanguage models, highlighting the importance of creating resources for\nunderrepresented languages. We make the code and dataset available.",
        upvotes=1,
        discussion_id="66ead57461228b02f8144d31",
    ),
    published_at=datetime.datetime(
        2024, 9, 19, 17, 17, 31, 279000, tzinfo=datetime.timezone.utc
    ),
    title="RoMath: A Mathematical Reasoning Benchmark in Romanian",
    thumbnail="",
    comments=0,
    submitted_by=User(
        username="IAMJB",
        fullname="JB D.",
        avatar_url="/avatars/1208629f14f010dbc2cd94f3c30f9baf.svg",
        details=None,
        is_following=None,
        is_pro=False,
        num_models=None,
        num_datasets=None,
        num_spaces=None,
        num_discussions=None,
        num_papers=None,
        num_upvotes=None,
        num_likes=None,
        num_following=None,
        num_followers=None,
        orgs=[],
    ),
)

Example DailyPaper from get_paper with paper_search:

DailyPaper(
    paper=Paper(
        paper_id="2409.11074",
        authors=[
            PaperAuthor(
                author_id="66ead57361228b02f8144cdf",
                user=None,
                name="Adrian Cosma",
                status="",
                status_changed_at=None,
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ead57361228b02f8144ce0",
                user=None,
                name="Ana-Maria Bucur",
                status="",
                status_changed_at=None,
                hidden=False,
            ),
            PaperAuthor(
                author_id="66ead57361228b02f8144ce1",
                user=None,
                name="Emilian Radoi",
                status="",
                status_changed_at=None,
                hidden=False,
            ),
        ],
        published_at=datetime.datetime(
            2024, 9, 17, 11, 3, 46, tzinfo=datetime.timezone.utc
        ),
        title="RoMath: A Mathematical Reasoning Benchmark in Romanian",
        summary="Mathematics has long been conveyed through natural language, primarily for\nhuman understanding. With the rise of mechanized mathematics and proof\nassistants, there is a growing need to understand informal mathematical text,\nyet most existing benchmarks focus solely on English, overlooking other\nlanguages. This paper introduces RoMath, a Romanian mathematical reasoning\nbenchmark suite comprising three datasets: RoMath-Baccalaureate,\nRoMath-Competitions and RoMath-Synthetic, which cover a range of mathematical\ndomains and difficulty levels, aiming to improve non-English language models\nand promote multilingual AI development. By focusing on Romanian, a\nlow-resource language with unique linguistic features, RoMath addresses the\nlimitations of Anglo-centric models and emphasizes the need for dedicated\nresources beyond simple automatic translation. We benchmark several open-weight\nlanguage models, highlighting the importance of creating resources for\nunderrepresented languages. We make the code and dataset available.",
        upvotes=1,
        discussion_id="66ead57461228b02f8144d31",
    ),
    published_at=datetime.datetime(
        2024, 9, 19, 17, 17, 31, 279000, tzinfo=datetime.timezone.utc
    ),
    title="RoMath: A Mathematical Reasoning Benchmark in Romanian",
    thumbnail="https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2409.11074.png",
    comments=0,
    submitted_by=User(
        username="IAMJB",
        fullname="JB D.",
        avatar_url="/avatars/1208629f14f010dbc2cd94f3c30f9baf.svg",
        details=None,
        is_following=None,
        is_pro=False,
        num_models=None,
        num_datasets=None,
        num_spaces=None,
        num_discussions=None,
        num_papers=None,
        num_upvotes=None,
        num_likes=None,
        num_following=None,
        num_followers=None,
        orgs=[],
    ),
)

HuggingFaceDocBuilderDev · 2024-10-02T16:48:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

hanouticelina

Hi @hlky, thanks a lot for working on this 🤗 I left a couple of comments to keep the design of the API minimal and consistent.

src/huggingface_hub/hf_api.py

hlky · 2024-10-02T22:57:21Z

Thanks for the review. I've made the required changes.

hanouticelina

Hi @hlky! thanks for this iteration! Apart from the comments below, I think we are almost there :)

src/huggingface_hub/hf_api.py

hlky · 2024-10-03T11:04:10Z

Thanks again. I've made the requested changes.

hanouticelina

Thanks @hlky for this iteration! Sorry for this back and forth review 😄

src/huggingface_hub/hf_api.py

hlky · 2024-10-03T13:27:43Z

Looks like the CI endpoint has no papers so the tests are failing.

hanouticelina · 2024-10-03T15:53:21Z

@hlky I pushed a fix where HfApi() is initialized directly in each test.
Patching setUpClass() was sufficient to run test_papers_by_date and test_papers_by_query but not for test_get_paper_by_id somehow.
Failing tests are unrelated, let's wait for a final review from @Wauplin 🙂
thanks again for working on this! 🤗

Wauplin

Hi there! Sorry been late on the feedback / review. I've checked the API and discussions above and I think we should settle on supporting only /api/papers/search which supports only the "q" parameter and drop support for /api/daily_papers. If we want to be able to search by date in the future, we will update the backend. The server-side API is not consistent (yet) so let's start small client-side and expand once the API has evolved. Sorry if that changes (again) the spec of this PR 🙈 Please see the details below

tests/test_hf_api.py

src/huggingface_hub/hf_api.py

tests/test_hf_api.py

Wauplin · 2024-10-08T12:31:03Z

src/huggingface_hub/hf_api.py

+list_papers = api.list_papers
+paper_info = api.paper_info


These two must be added at the root of huggingface_hub package. To do so, you need to add them to this list and run make style which will make sure alphabetical order is respected + add a type checking annotation. You can then commit the changes.

src/huggingface_hub/hf_api.py

Co-authored-by: Celina Hanouti <hanouticelina@gmail.com>

Co-authored-by: Lucain <lucainp@gmail.com>

hlky · 2024-10-09T10:20:56Z

Thanks for the review. I've made the requested changes.

hanouticelina

hi @hlky,
looks good to me, thanks a lot! I pushed the additional test suggested by @Wauplin here and fixed the docstring in 22a5ad5. 🤗

Wauplin

Looks good! Thanks for adding this 🤗

hlky force-pushed the papers-api branch from 8b41f7f to f095085 Compare September 21, 2024 00:06

hanouticelina self-requested a review October 2, 2024 16:44

hanouticelina requested changes Oct 2, 2024

View reviewed changes

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved

hanouticelina reviewed Oct 2, 2024

View reviewed changes

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved

hlky force-pushed the papers-api branch 2 times, most recently from 14557f2 to 90bf50c Compare October 2, 2024 22:56

hanouticelina self-requested a review October 3, 2024 09:21

hanouticelina requested changes Oct 3, 2024

View reviewed changes

hlky force-pushed the papers-api branch from c3cc316 to afadc22 Compare October 3, 2024 11:03

hanouticelina requested changes Oct 3, 2024

View reviewed changes

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved

Wauplin reviewed Oct 8, 2024

View reviewed changes

hlky force-pushed the papers-api branch from a2c84fd to a4826e4 Compare October 9, 2024 10:20

hlky and others added 7 commits October 9, 2024 11:20

Daily Papers API

ea8cf80

Apply suggestions from code review

4729dc8

Co-authored-by: Celina Hanouti <hanouticelina@gmail.com>

Apply suggestions from code review

e566c9b

Co-authored-by: Celina Hanouti <hanouticelina@gmail.com>

Fix tests

403b45f

Run papers API tests independently

ef6d225

Apply suggestions from code review

f0d7875

Co-authored-by: Lucain <lucainp@gmail.com>

Remove date

a84ffe0

hlky force-pushed the papers-api branch from a4826e4 to a84ffe0 Compare October 9, 2024 10:20

additional test and update docstring

22a5ad5

hanouticelina approved these changes Oct 9, 2024

View reviewed changes

Wauplin approved these changes Oct 9, 2024

View reviewed changes

hanouticelina merged commit 2c7c19d into huggingface:main Oct 9, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daily Papers API #2554

Daily Papers API #2554

hlky commented Sep 19, 2024 •

edited

Loading

hlky commented Sep 19, 2024

hlky commented Sep 21, 2024

HuggingFaceDocBuilderDev commented Oct 2, 2024

hanouticelina left a comment

hlky commented Oct 2, 2024

hanouticelina left a comment

hlky commented Oct 3, 2024

hanouticelina left a comment

hlky commented Oct 3, 2024

hanouticelina commented Oct 3, 2024

Wauplin left a comment

Wauplin Oct 8, 2024

hlky commented Oct 9, 2024

hanouticelina left a comment

Wauplin left a comment

		list_papers = api.list_papers
		paper_info = api.paper_info

Daily Papers API #2554

Daily Papers API #2554

Conversation

hlky commented Sep 19, 2024 • edited Loading

hlky commented Sep 19, 2024

hlky commented Sep 21, 2024

HuggingFaceDocBuilderDev commented Oct 2, 2024

hanouticelina left a comment

Choose a reason for hiding this comment

hlky commented Oct 2, 2024

hanouticelina left a comment

Choose a reason for hiding this comment

hlky commented Oct 3, 2024

hanouticelina left a comment

Choose a reason for hiding this comment

hlky commented Oct 3, 2024

hanouticelina commented Oct 3, 2024

Wauplin left a comment

Choose a reason for hiding this comment

Wauplin Oct 8, 2024

Choose a reason for hiding this comment

hlky commented Oct 9, 2024

hanouticelina left a comment

Choose a reason for hiding this comment

Wauplin left a comment

Choose a reason for hiding this comment

hlky commented Sep 19, 2024 •

edited

Loading