Improve JF2 references #20

aciccarello · 2023-04-28T22:09:48Z

Is your feature request related to a problem?

I've noticed that the data returned by references isn't as normalized as I'd like, leading to lots of extra properties and missing author properties. When I compare the output of https://xray.p3k.app/ to the references, xray is able to handle the output while Indiekit is less organized.

Example 1: https://jamesg.blog/2023/04/18/source-code-folder-names/

{
  "url": "https://jamesg.blog/2023/04/18/source-code-folder-names/",
  "children": [
    {
      "type": "card",
      "name": "James' Coffee Blog ☕",
      "url": "https://jamesg.blog"
    },
    {
      "type": "entry",
      "name": "My source code root folder name",
      "published": "2023-04-18T00:00:00",
      "category": "Coding",
      "content": {
        "html": "<p>I like seeing what people call the root folder in which they store their source code. This is the folder where all — or a lot of — your projects are stored. In my case, my programming projects go in a folder called <code>src</code>. (Although I have a strange habit of nesting personal projects that are related to each other. I believe my source code files are in need of a spring clean.)</p>\n<p>That long parenthetical notwithstanding, I find the name <code>src</code> cool. It’s a short way of saying source code; apt, simple, easy to type. Furthermore, <code>src</code> is different to the names of the other folders in my root directory, which makes autocomplete a breeze when I’m tying in my terminal to navigate to a source code folder.</p>\n<p>A common example I have seen is <code>Code</code>, or variants thereof. I’m curious: if you code, what do you call the root folder in which you store your source code?</p>",
        "text": "I like seeing what people call the root folder in which they store their source code. This is the folder where all — or a lot of — your projects are stored. In my case, my programming projects go in a folder called src. (Although I have a strange habit of nesting personal projects that are related to each other. I believe my source code files are in need of a spring clean.)\nThat long parenthetical notwithstanding, I find the name src cool. It’s a short way of saying source code; apt, simple, easy to type. Furthermore, src is different to the names of the other folders in my root directory, which makes autocomplete a breeze when I’m tying in my terminal to navigate to a source code folder.\nA common example I have seen is Code, or variants thereof. I’m curious: if you code, what do you call the root folder in which you store your source code?"
      }
    }
  ]
}

Example 2: https://aaronparecki.com/2023/04/24/8/lawyer

{
  "url": "https://aaronparecki.com/2023/04/24/8/lawyer",
  "children": [
    {
      "type": "item"
    },
    {
      "type": "item"
    },
    {
      "type": "item"
    },
    {
      "type": "item"
    },
    {
      "type": "item"
    },
    {
      "type": "entry",
      "author": {
        "type": "card",
        "url": "https://aaronparecki.com/",
        "photo": [
          {
            "alt": "Aaron Parecki",
            "url": "https://aaronparecki.com/images/profile.jpg"
          }
        ],
        "name": "Aaron Parecki"
      },
      "content": {
        "html": "In retrospect, I probably didn't need to include \"but I am not a lawyer\" in an email to our lawyers",
        "text": "In retrospect, I probably didn't need to include \"but I am not a lawyer\" in an email to our lawyers"
      },
      "location": {
        "type": "adr",
        "locality": "Portland",
        "region": "Oregon",
        "country": "USA"
      },
      "url": "https://aaronparecki.com/2023/04/24/8/lawyer",
      "published": "2023-04-24T14:12:20-07:00",
      "syndication": [
        "at://did:plc:s2koow7r6t7tozgd4slc3dsg/app.bsky.feed.post/3ju5hvccis32q",
        "https://micro.blog/aaronpk/18625298"
      ],
      "pk-num-likes": "15",
      "pk-num-reposts": "1",
      "pk-num-replies": "3",
      "like": {
        "children": [
          {
            "type": "cite",
            "url": [
              "https://emacs.ch/users/skybert#likes/56653",
              "https://emacs.ch/users/skybert"
            ],
            "author": {
              "type": "card",
              "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
              "name": ""
            },
            "name": "15 of these cite elements"
          }
        ]
      },
      "repost": {
        "type": "cite",
        "url": [
          "https://tdd.social/users/CodingItWrong/statuses/110256727388551917/activity",
          "https://tdd.social/users/CodingItWrong"
        ],
        "author": {
          "type": "card",
          "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
          "name": ""
        },
        "name": "Josh Justice"
      },
      "comment": {
        "children": [
          {
            "type": "cite",
            "author": {
              "type": "card",
              "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
              "name": "dominikhoecht",
              "url": "https://micro.blog/dominikhoecht"
            },
            "content": {
              "html": "<p><a href=\"https://micro.blog/aaronpk\" rel=\"nofollow\">@aaronpk</a> 😂</p>",
              "text": "@aaronpk 😂"
            },
            "url": "https://micro.blog/dominikhoecht/18694091",
            "published": "2023-04-27T15:41:21+00:00"
          },
          {
            "type": "cite",
            "author": {
              "type": "card",
              "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
              "name": "carpetbomberz",
              "url": "https://mastodon.online/users/carpetbomberz"
            },
            "content": {
              "html": "<p><span class=\"h-card\"><a href=\"https://aaronparecki.com/aaronpk\" class=\"u-url\">@<span>aaronpk</span></a></span> In your defense you are most authoritative on the many subjects upon which you expound. I'm thinking back to the episode on ContentID fer' instance. 😄</p>",
              "text": "@aaronpk In your defense you are most authoritative on the many subjects upon which you expound. I'm thinking back to the episode on ContentID fer' instance. 😄"
            },
            "url": "https://mastodon.online/@carpetbomberz/110256111311133962",
            "published": "2023-04-24T15:19:05-07:00",
            "children": [
              {
                "type": "card",
                "url": "https://aaronparecki.com/aaronpk",
                "name": "@aaronpk"
              }
            ]
          },
          {
            "type": "cite",
            "author": {
              "type": "card",
              "photo": "https://aaronparecki.com/assets/images/no-profile-photo.png",
              "name": "lmika",
              "url": "https://micro.blog/lmika"
            },
            "content": {
              "html": "<p><a href=\"https://micro.blog/aaronpk\" rel=\"nofollow\">@aaronpk</a> Just hope that they don’t reply with “I’m not a lawyer either”. 😀</p>",
              "text": "@aaronpk Just hope that they don’t reply with “I’m not a lawyer either”. 😀"
            },
            "url": "https://micro.blog/lmika/18625632",
            "published": "2023-04-24T21:45:09+00:00"
          }
        ]
      }
    },
    {
      "type": "card",
      "url": "https://aaronparecki.com/",
      "uid": "https://aaronparecki.com/",
      "photo": "https://aaronparecki.com/images/profile.jpg",
      "note": "Hi, I'm Aaron Parecki, Senior Security Architect at Okta, and co-founder of\nIndieWebCamp.\nI maintain oauth.net, write and consult about OAuth, and\nparticipate in the OAuth Working Group at the IETF. I also help people learn about video production and livestreaming and dabble in product design.\n\nI've been tracking my location since 2008 and I wrote 100 songs in 100 days.\nI've spoken at conferences around the world about\nowning your data,\nOAuth,\nquantified self,\nand explained why R is a vowel. Read more.",
      "name": "Aaron Parecki",
      "bday": "--12-28",
      "street-address": "PO Box 12433",
      "locality": "Portland",
      "region": "Oregon",
      "country-name": "USA",
      "postal-code": "97212",
      "org": {
        "children": [
          {
            "type": "card",
            "photo": "https://aaronparecki.com/images/okta.png",
            "role": "Security Architect",
            "url": "https://developer.okta.com/",
            "name": "Okta"
          },
          {
            "type": "card",
            "photo": "https://aaronparecki.com/images/indiewebcamp.png",
            "url": "https://indieweb.org/",
            "name": "IndieWebCamp",
            "role": "Founder"
          }
        ]
      }
    }
  ]
}

Describe the solution you’d like

I'd like the references to show a much simpler model, including the entry at the top level with author data included.
I'm guessing the solution probably rests in the mf2tojf2 package.

X-Ray Output 1: https://jamesg.blog/2023/04/18/source-code-folder-names/

{
    "data": {
        "type": "entry",
        "published": "2023-04-18T00:00:00",
        "category": [
            "Coding"
        ],
        "name": "My source code root folder name",
        "content": {
            "text": "I like seeing what people call the root folder in which they store their source code. This is the folder where all \u2014 or a lot of \u2014 your projects are stored. In my case, my programming projects go in a folder called src. (Although I have a strange habit of nesting personal projects that are related to each other. I believe my source code files are in need of a spring clean.)\nThat long parenthetical notwithstanding, I find the name src cool. It\u2019s a short way of saying source code; apt, simple, easy to type. Furthermore, src is different to the names of the other folders in my root directory, which makes autocomplete a breeze when I\u2019m tying in my terminal to navigate to a source code folder.\nA common example I have seen is Code, or variants thereof. I\u2019m curious: if you code, what do you call the root folder in which you store your source code?",
            "html": "<p>I like seeing what people call the root folder in which they store their source code. This is the folder where all \u2014 or a lot of \u2014 your projects are stored. In my case, my programming projects go in a folder called <code>src</code>. (Although I have a strange habit of nesting personal projects that are related to each other. I believe my source code files are in need of a spring clean.)</p>\n<p>That long parenthetical notwithstanding, I find the name <code>src</code> cool. It\u2019s a short way of saying source code; apt, simple, easy to type. Furthermore, <code>src</code> is different to the names of the other folders in my root directory, which makes autocomplete a breeze when I\u2019m tying in my terminal to navigate to a source code folder.</p>\n<p>A common example I have seen is <code>Code</code>, or variants thereof. I\u2019m curious: if you code, what do you call the root folder in which you store your source code?</p>"
        },
        "author": {
            "type": "card",
            "name": "James' Coffee Blog \u2615",
            "url": "https://jamesg.blog",
            "photo": null
        },
        "post-type": "article"
    },
    "url": "https://jamesg.blog/2023/04/18/source-code-folder-names/",
    "code": 200,
    "source-format": "mf2+html"
}

X-Ray Output 2: https://aaronparecki.com/2023/04/24/8/lawyer

{
    "data": {
        "type": "entry",
        "published": "2023-04-24T14:12:20-07:00",
        "url": "https://aaronparecki.com/2023/04/24/8/lawyer",
        "syndication": [
            "https://micro.blog/aaronpk/18625298"
        ],
        "content": {
            "text": "In retrospect, I probably didn't need to include \"but I am not a lawyer\" in an email to our lawyers"
        },
        "author": {
            "type": "card",
            "name": "Aaron Parecki",
            "url": "https://aaronparecki.com/",
            "photo": "https://aaronparecki.com/images/profile.jpg"
        },
        "post-type": "note"
    },
    "url": "https://aaronparecki.com/2023/04/24/8/lawyer",
    "code": 200,
    "source-format": "mf2+json"
}

Describe alternatives you’ve considered

I'm currently trying to normalize the input in my post template function but I think it would be helpful to the community to have shared logic.

Additional context

No response

paulrobertlloyd · 2023-05-20T00:10:21Z

Just had a look at the code underlining Xray and… wow. 1008 lines of code to parse and massage an incoming feed to generate the results you are seeing here. 🤯

This project started as a straight forward implementation of mf2tojf2.py, and for incoming well-structured MF2 objects, it works. But when given a page of unknown microformatted markup, it’s going to struggle to produce well-formed data.

I’d like to say this is something that I can look to improve, but not at the cost of working on Indiekit – even more so given including references is an option that’s disabled by default.

Perhaps there’s a way of breaking this apart and looking to make smaller, incremental improvements (the list of empty children with only { type: "item" } seems like something that shouldn’t happen, for example).

Open to suggestions… maybe this is something to put to the IndieWeb community to see if anyone would like to contribute parsing improvements?

aciccarello · 2023-05-20T21:43:59Z

I agree that this is much more of a nice to have than some of the key indiekit work. I imagine that aiming to normalize all messy content would be an impossible task. I'd need to look more at xray and see if there is any set of agreed upon parsing specs to come up with a list of target improvements.

I'll probably keep iterating on my own massaging logic I'm using with my Indiekit instance. I'd love to also include some Metaformats logic too to parse meta tags on sites without mf2. So far the main things I've added are finding the main h-entry and author but I'm sure I'll discover more as I reply to more sites with Indiekit.

aciccarello · 2023-05-20T22:04:17Z

I split out a couple more specific tasks, however if you think these should be handled externally I can look at adding this kind of logic to a different library.

paulrobertlloyd · 2023-05-21T16:13:03Z

Looking at the authorship spec you linked to in #22, I spotted mf-obj, a Node.js package that seems to cover some of the requirements here.

It has’t been updated for 7 years, and unfortunately written in TypeScript, but maybe that could be used, or adapted for use here?

aciccarello · 2023-05-23T08:19:48Z

Before I push for or try to contribute features to this library, I want check to make sure mftojf2 would be the best place for some of this functionality. Ideally we'd avoid different libraries downloading and parsing pages multiple times but some reworking of MF2 objects by different libraries could be composed. Like you mentioned earlier, I would like to see a little more collaboration and coordination of efforts around node libraries but I also wouldn't want to tie things to another package that no one has capacity to maintain.

Relevant Libraries

Library	Last Release	Focus	Input	Output	Notes
@paulrobertlloyd/mf2tojf2	2022-11	Convert to jf2	MF2	JF2	This library. Also loads reference URLs
microformat-node	2016-10	Parsing	URL	MF2	Recommended but no recent activity
mf-obj	2016-06	Utils	URL	MF2	Uses microformat-node. Implements authorship algorithm
microformats-parser	2022-01	Parsing	HTML	MF2	Used by mf2tojf2
mf2utiljs	2022-02	Utils	URL or MF2	MF2	Uses microformats-parser. Port of mf2util. Implements authorship algorithm
~~representative-h-card~~	~~2021-06~~	~~Util~~	~~MF2~~	~~MF2~~	REPO ARCHIVED

Microformat parsing features

Feature	Input	Output	Notes
References	MF2 to get URL	MF2	Already implemented here. Requires parsing but fetches different URL
Authorship	MF2	MF2	Implemented in mf2utilsjs
Main Entry	MF2	MF2	Implemented in mf2utilsjs?
Metaformats	HTML	MF2	This probably should be included in lib that does initial parsing

Opportunities to reuse logic

Let me know what you think but I'd love to see this type of functionality we're discussing pushed to other libraries and used more flexibly by the node community.

mf2utilsjs for cleaning up microformats

Turns out there is more on npm than I initially though. I hadn't seen mf2utilsjs before. Since it ports the well used python package, I think it has a lot of potential for being a really useful package. I would probably want to check with the maintainer to see if they are up for more community involvement. But assuming it is a reliable library, I could see a microformats-parser > mf2utilsjs > mf2tojf2 combo working well.

Leave metaformats to initial parser

Something like Metaformats might be better as a feature of microformats-parser since that would require the fetching and parsing raw HTML to get meta tags. Implementing it in another library would duplicate that fetching work. I don't think that should be enabled by default, but microformats-parser already has a set of experimentalOptions flags.

paulrobertlloyd · 2023-05-23T11:50:23Z

If you wanted to submit a PR that used mf2utiljs to clean up incoming Microformats to use in references, I think that would be really useful, and potentially solve this issue!

I wonder if its a case of parsing the Microformats returned here with mf2utiljs:

mf2tojf2/lib/fetch-references.js

Line 37 in 3a0817d

const mf2 = await fetchMf2(url);

aciccarello · 2023-08-13T04:53:19Z

From what I can tell mf2utiljs was a one-off personal project. I haven't gotten any response about being open to community involvement. I think we might need to implement the authorship and main entry algorithms separately. I'm considering creating a library but would prefer to avoid creating a separate package if it could be avoided.

paulrobertlloyd · 2023-08-14T09:44:13Z

If you’d like to contribute a PR to add them to this project, I think that could work. These algorithms do seem to fall into the category of converting mf2 to JF2.

At some point I also think it would make sense to ask about moving this project to the @microformats organisation, much like the new Node Microformats parser was, meaning this project can live alongside that project and mf2tojf2.py.

aciccarello added the enhancement New feature or request label Apr 28, 2023

paulrobertlloyd transferred this issue from getindiekit/indiekit Apr 28, 2023

paulrobertlloyd added the help wanted Extra attention is needed label May 20, 2023

This was referenced May 20, 2023

Find the main entry on the page #21

Open

Implement authorship spec #22

Open

This was referenced May 25, 2023

Support metaformats fallback option microformats/microformats-parser#224

Closed

Open to community involvement? drivet/mf2utiljs#1

Open

aciccarello mentioned this issue Aug 6, 2023

add metaformats fallback to fetch-references #23

Merged

aciccarello added this to @aciccarello's IndieWeb Tasks Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve JF2 references #20

Improve JF2 references #20

aciccarello commented Apr 28, 2023 •

edited

Loading

paulrobertlloyd commented May 20, 2023

aciccarello commented May 20, 2023

aciccarello commented May 20, 2023

paulrobertlloyd commented May 21, 2023 •

edited

Loading

aciccarello commented May 23, 2023

paulrobertlloyd commented May 23, 2023

aciccarello commented Aug 13, 2023

paulrobertlloyd commented Aug 14, 2023 •

edited

Loading

Improve JF2 references #20

Improve JF2 references #20

Comments

aciccarello commented Apr 28, 2023 • edited Loading

Is your feature request related to a problem?

Example 1: https://jamesg.blog/2023/04/18/source-code-folder-names/

Example 2: https://aaronparecki.com/2023/04/24/8/lawyer

Describe the solution you’d like

X-Ray Output 1: https://jamesg.blog/2023/04/18/source-code-folder-names/

X-Ray Output 2: https://aaronparecki.com/2023/04/24/8/lawyer

Describe alternatives you’ve considered

Additional context

paulrobertlloyd commented May 20, 2023

aciccarello commented May 20, 2023

aciccarello commented May 20, 2023

paulrobertlloyd commented May 21, 2023 • edited Loading

aciccarello commented May 23, 2023

Relevant Libraries

Microformat parsing features

Opportunities to reuse logic

mf2utilsjs for cleaning up microformats

Leave metaformats to initial parser

paulrobertlloyd commented May 23, 2023

aciccarello commented Aug 13, 2023

paulrobertlloyd commented Aug 14, 2023 • edited Loading

aciccarello commented Apr 28, 2023 •

edited

Loading

paulrobertlloyd commented May 21, 2023 •

edited

Loading

paulrobertlloyd commented Aug 14, 2023 •

edited

Loading