Using pup to modify html #51

bramp · 2015-08-26T17:01:01Z

I have a use case where I want to add a class="table" to all table tags, that don't already have that class specified. Currently I use a hacky sed to do it, but I was wondering if pup could be a more robust way of doing that.

The text was updated successfully, but these errors were encountered:

ericchiang · 2015-08-30T19:02:13Z

Would love this but it seems like this would need it's own language to describe the modification (like s/foo/bar/ for sed). Know of any tools that do something similar?

bramp · 2015-08-30T19:41:13Z

I'm not aware of any easy/simple similar tool to do modifications. There is XSLT, which is more complex, but would support this kind of transformation.

np · 2015-12-20T14:38:28Z

What about a flag which would emit the original file except for each part that has been selected the tool print what the display function returns.

While not complete this sounds like a really good start.

np · 2015-12-20T15:02:07Z

This actually rely on having a slightly more powerful set of display functions. Following jq a good start could be to have: (1) a way to emit arbitrary text using a template string. (2) a way to emit html tags.

ericchiang · 2015-12-20T17:29:46Z

@np these seem like inverse selectors rather than transformations. I actually like the idea of having a --template flag of some sort, but I think these are new issues.

np · 2015-12-22T21:31:24Z

I was thinking that combining these templates plus a global flag which would print back the context, namely what surrounds the selected parts.

ericchiang · 2015-12-22T21:46:44Z

@np, could you provide an example of what that would look like? I'm having trouble understanding how this relates to transformations.

np · 2015-12-24T21:59:55Z

Ok, this would work as follows, let's say we have a new flag --transform for instance.

First, running pup with a selector but no display function would be pretty useless as this would print the original page. So pup --transform <SOME_SELECTOR> would do the same as just pup.

Now combined with a display function one can transform/edit the page. For instance assuming a --template we could write something such as:

pup --transform --template '<div class=foo>%s</div>' a

This command would wrap all a tags with a div tag.

tehmoon · 2019-05-30T02:59:25Z

I was going to open an issue but ended up here instead. I have that need too, and using sed is incredibly annoying for html.

Since you guys added the json{} is there a way to import JSON instead? I feel like everything can be done in jq if it is possible to do json -> html. Alas, I did not find any tool to do so. Open to suggestions!!

Also thank you for that incredible piece of software.

Hrxn · 2019-05-30T04:27:48Z

For converting any markup formats, or similar, pandoc is probably the best idea.

https://github.com/jgm/pandoc
https://pandoc.org/

Conversion from/to JSON and any other supported format should also be possible.

tehmoon · 2019-05-30T16:10:28Z

Thank you! this looks nice indeed. I was looking for a way to safely replace stuff inside of the html, but I might have found a way using pup -i 0 to format everything nicely and just sed to match the line. This way it actually is a little bit more easier and safer to regex using sed.

b0o · 2020-07-29T05:07:54Z

Here's an idea for how to accomplish this with pup, elaborating on @tehmoon's statement:

Add a new output display function json-full{} which returns an object containing the matched elements along with the full HTML tree:

$ < input.html pup 'p json-full{}'

{
  "match": [
    {
      "tag": "p",
      "text": "This is my website."
    },
    {
      "tag": "p",
      "text": "I hope you like it."
    },
    {
      "tag": "p",
      "text": "Ok cya later."
    }
  ],
  "tree": {
    "children": [
      {
        "children": [
          {
            "children": [
              {
                "tag": "title",
                "text": "Hello World"
              }
            ],
            "tag": "head"
          },
          {
            "children": [
              {
                "children": [
                  {
                    "alt": "logo",
                    "src": "https://example.com/logo.png",
                    "tag": "img"
                  },
                  {
                    "tag": "h1",
                    "text": "Hello World"
                  }
                ],
                "tag": "header"
              },
              {
                "children": [
                  {
                    "match": 0
                  },
                  {
                    "match": 1
                  },
                  {
                    "match": 2
                  }
                ],
                "tag": "div"
              }
            ],
            "tag": "body"
          }
        ],
        "tag": "html"
      }
    ],
    "tag": ""
  }
}

Then jq could be used to do the actual mutation:

$ < input.html pup 'p json-full{}' | jq '.match = (.match | map(.class = "foobar"))'

{
  "match": [
    {
      "tag": "p",
      "text": "This is my website.",
      "class": "foobar"
    },
    {
      "tag": "p",
      "text": "I hope you like it.",
      "class": "foobar"
    },
    {
      "tag": "p",
      "text": "Ok cya later.",
      "class": "foobar"
    }
  ],
  "tree": {
    "children": [
      {
        "children": [
          {
            "children": [
              {
                "tag": "title",
                "text": "Hello World"
              }
            ],
            "tag": "head"
          },
          {
            "children": [
              {
                "children": [
                  {
                    "alt": "logo",
                    "src": "https://example.com/logo.png",
                    "tag": "img"
                  },
                  {
                    "tag": "h1",
                    "text": "Hello World"
                  }
                ],
                "tag": "header"
              },
              {
                "children": [
                  {
                    "match": 0
                  },
                  {
                    "match": 1
                  },
                  {
                    "match": 2
                  }
                ],
                "tag": "div"
              }
            ],
            "tag": "body"
          }
        ],
        "tag": "html"
      }
    ],
    "tag": ""
  }
}

And finally pup could convert this back into HTML:

$ < input.html pup 'p json-full{}' | jq '.match = (.match | map(.class = "foobar"))' | pup --from-json

<html>
  <head>
    <title>Hello World</title>
  </head>
  <header>
    <img src="https://example.com/logo.png" alt="logo">
    <h1>Hello World</h1>
  </header>
  <div>
    <p class="foobar">This is my website.</p>
    <p class="foobar">I hope you like it.</p>
    <p class="foobar">Ok cya later.</p>
  </div>
</html>

This assumes that the tree output by pup contains all of the information necessary to reconstruct the original HTML (semantically, at least). This seems to be mostly true, but notably the doctype seems to be omitted by pup '* json{}'.

The exact behavior would need to be worked out:

should matched nodes be included in both a matches array and their original position in the DOM tree, or should they only appear in the matches array, or should they appear in both?
should the matches array contain the path to the matched node?

Implementing this shouldn't be too hard. All that would be needed:

The json-full{} (pending a better name) display function:

the parsed tree should already be available
annotate the parsed tree with the match index on matched nodes
construct the json output with the matches + the annotated tree

The ability to reverse the JSON output back to HTML. I haven't looked into pup's source but I would imagine some of the existing code for parsing the HTML could be used.

The biggest benefit of this approach is that it obviates the need to implement some sort of mutation/templating DSL inside pup, since many other utilities like jq have already done that and done it well. This leaves pup to do one thing, per the unix philosophy: parse HTML.

Unfortunately, the project seems largely unmaintained, so I don't feel super comfortable attempting to implement this if the PR would just sit unnoticed for months. It's a shame because I think pup fills an important niche in the world of unix utilities. I think it could be even more synergistic with jq and other unix utilities if something like this were implemented.

Editing to say that, upon closer inspection of the issues, several bugs/inconsistencies in the JSON output would need to be fixed before this would work dependably:

Provide a more faithful json mapping of html sibling nodes of mixed types #110 (has PR - Preserving sibling relationship of all node types #111)
Support unknown tag selectors #107 / Doesnt handle tags with "-" (hyphen) in it #114 (has PR - Support unknown tag selectors #107)
It seems that with multiple selectors separated by commas, if a single node is matched by more than one selector, it appears in the output once for each matching selector, as opposed to the way JavaScript's document.querySelectorAll() reports matching elements, not matches. This would be an issue.

Really, although this feature itself would be simple, the prerequisite would be that pup can properly parse all valid HTML, preserving all information necessary to reconstruct it in the JSON output, which is no small feat.

I can think of a few alternative but similar strategies that might mitigate some of these issues - e.g. rather than expecting pup to be able to properly parse the whole tree, the input HTML could be passed to both invocations of pup, and for the replacement you would pass pup a mutated version of the JSON output from the first invocation.

Alternatively, pup could support a '--mutate-cmd' which would accept a command that pup would run on the matched JSON and use the output to update the HTML. This could behave similar to how xargs works. An added benefit over the previous suggestions would be that only one invocation of pup would be necessary.

gizlu · 2022-09-20T22:33:41Z

Sketch of imaginary pup/hx/xmlstarlet-like tool that is able to modify html (I drop it here as my idea for how tool like this could work)

cat file.html | hu 'selectors' [command] command_arg...

Commands

sel [-c]
  extract elements matching selector. If multiple elements are matched, they are concated together
    -c print content only. Without -c start/end tags are printed as well
del
  remove matched elements from html
set sth
  replace each match with suplied string. Selectors might use pseudo-clasess :before and :after
setf fmt ...
  set, but printf-like. Post-fmt args can use funcs like match() or file()
move dest-sel
  move match into suplied destination. dest-sel must use pseudo-class like :before or :after
aset key value
  set atribute of each matched element

Examples

# move charset meta tag to top
hu 'meta[charset]' move 'head::before'
# remove empty links
hu 'a:empty' del
# extract all embedded css, process it with external tool (hipotetical cssmin), and paste again
css=$(hu 'style' sel -c <in.html | cssmin)
hu 'style' del <in.html | hu 'head:before' setf '<style>%s</style>' "$css" >out.html

Other ideas

Add "propertiary" pseudoselectors that insert stuff before/after match. Yes, there are standard :before and :after but they don't really do what you would expect from their names (they insert stuff at beginning and end of match)
aget [-c] [keys] command - get atributes of matched elem
count command - print count of occurences of each selector (possible use case: removing dead css)
command, that would spawn other program, supply match to it, and replace match with its output (it would make cssmin example simpler)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pup to modify html #51

Using pup to modify html #51

bramp commented Aug 26, 2015

ericchiang commented Aug 30, 2015

bramp commented Aug 30, 2015

np commented Dec 20, 2015

np commented Dec 20, 2015

ericchiang commented Dec 20, 2015

np commented Dec 22, 2015

ericchiang commented Dec 22, 2015

np commented Dec 24, 2015

tehmoon commented May 30, 2019

Hrxn commented May 30, 2019

tehmoon commented May 30, 2019 •

edited

Loading

b0o commented Jul 29, 2020 •

edited

Loading

gizlu commented Sep 20, 2022 •

edited

Loading

Using pup to modify html #51

Using pup to modify html #51

Comments

bramp commented Aug 26, 2015

ericchiang commented Aug 30, 2015

bramp commented Aug 30, 2015

np commented Dec 20, 2015

np commented Dec 20, 2015

ericchiang commented Dec 20, 2015

np commented Dec 22, 2015

ericchiang commented Dec 22, 2015

np commented Dec 24, 2015

tehmoon commented May 30, 2019

Hrxn commented May 30, 2019

tehmoon commented May 30, 2019 • edited Loading

b0o commented Jul 29, 2020 • edited Loading

gizlu commented Sep 20, 2022 • edited Loading

Commands

Examples

Other ideas

tehmoon commented May 30, 2019 •

edited

Loading

b0o commented Jul 29, 2020 •

edited

Loading

gizlu commented Sep 20, 2022 •

edited

Loading