Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to extract individual values raw #3

Closed
hoshsadiq opened this issue Mar 12, 2018 · 25 comments
Closed

Add option to extract individual values raw #3

hoshsadiq opened this issue Mar 12, 2018 · 25 comments

Comments

@hoshsadiq
Copy link

hoshsadiq commented Mar 12, 2018

something like this:

<p><a href="/home">some url</a></p>
$ cascadia --in my.html --out --css 'p > a' --raw-text
some url
$ cascadia --in my.html --out --css 'p > a' --raw-attr 'href'
/home
@suntong
Copy link
Owner

suntong commented Mar 12, 2018

I don't think CSS selection can select base on attributes though.

Do you want to give another example instead?

@suntong
Copy link
Owner

suntong commented Mar 12, 2018

If you just want to raw CSS selection value,

The work around is to use one -p after -c:

  -p, --piece           sub CSS selectors within -css to split that block up into pieces
                        format: PieceName=[RAW:]selector_string

e.g.,

$ echo '<p><a href="/home">some url</a></p>' | cascadia -i -o -c 'p' -p 'ATag=a'
ATag
some url

$ echo '<p><a href="/home">some url</a></p>' | cascadia -i -o -c 'p' -p 'ATag=RAW:a'
ATag
<a href="/home">some url</a>

It's not perfect, but cascadia was not built to be perfect but for quick hacks.

@hoshsadiq
Copy link
Author

hoshsadiq commented Mar 12, 2018

The issue with using -p is that it creates columns (with headers). My proposal is getting attributes for an individual elements. in my case I'm downloading the latest version of a file automatically, there's no other way to get "latest" version. example:

<html>
<head></head>
<body>
...
<div>
<a href="/files/plugin-name/download?version=1.25">1.25</a>
</div>
...
</body>
</html>

I want to be able to retrieve only the value of href with nothing else, as well as innerText of that anchor.

@hoshsadiq
Copy link
Author

Perhaps an alternative would be adding a --no-header option for --piece, that doesn't print out the headers?

@suntong
Copy link
Owner

suntong commented Mar 12, 2018

Gotya. Lee me think it over...

@hoshsadiq
Copy link
Author

Happy to raise a PR if needed

@suntong
Copy link
Owner

suntong commented Mar 14, 2018

Uhm... I gave a careful thought about it, and was about to turn it down, because my believe in the "Unix philosophy" -- Write programs that

  • do one thing and do it well
  • to work together
  • to handle text streams

I.e., the attributes selection is impossible with CSS thus should be out of the scope of cascadia; and --no-header can be simply solved by sed 1d:

$ echo '<p><a href="/home">some url</a></p>' | cascadia -i -o -c 'p' -p 'ATag=a' | sed 1d
some url

I.e., it'd against my principle to complicate my code base for something so simple to solve. However, I do see a need in your request, and you offered a PR. So I'm OK with the PR, iff you are doing the correct way -- i.e., starting from cascadia.yaml, and use wireframe for the code gen.

If that doesn't deter you, then go ahead. :-)

Thx for your contribution.

@suntong suntong closed this as completed Apr 19, 2020
@suntong
Copy link
Owner

suntong commented Apr 19, 2020

closed for lack of activity, please reopen if there is more input...

@mazznoer
Copy link

@hoshsadiq

You can do it using pup.

pup -f file.html 'body div a attr{href}'

@0xdevalias
Copy link
Contributor

0xdevalias commented Jun 14, 2023

Just stumbled onto this issue as I was attempting to extract all of the src attributes for the script tags from a page, and it sounds like that should be possible with --piece, yet it didn't work for me:

curl --silent https://example.com/somethingwithscripts | cascadia --in --out --css 'html > head > script' --piece url='attr[src]'

I also tried various variations of this none of which seemed to work:

curl --silent https://example.com/somethingwithscripts | cascadia --in --out --css 'html > head > script' --piece url='attr[src]:script'

curl --silent https://example.com/somethingwithscripts | cascadia --in --out --css 'html > head > script' --piece url='attr[src]:*'

curl --silent https://example.com/somethingwithscripts | cascadia --in --out --css 'html > head' --piece url='attr[src]:script'

Yet with pup, it was not only a far simpler syntax, but also just worked the first time I tried it:

⇒ curl --silent https://example.com/somethingwithscripts | pup 'html > head > script attr{src}'

/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js
/_next/static/chunks/webpack-1eeae5c7aedde088.js
/_next/static/chunks/framework-e23f030857e925d4.js
/_next/static/chunks/main-35ce5aa6f4f7a906.js
/_next/static/chunks/pages/_app-0df67bf7d9e6e732.js
/_next/static/chunks/1f110208-cda4026aba1898fb.js
/_next/static/chunks/012ff928-bcfa62e3ac82441c.js
/_next/static/chunks/68a27ff6-a453fd719d5bf767.js
/_next/static/chunks/bd26816a-981e1ddc27b37cc6.js
/_next/static/chunks/692-a1e5a91f2cd1f1d0.js
/_next/static/chunks/434-6f11f27f549beeab.js
/_next/static/chunks/97-536ee884c863676e.js
/_next/static/chunks/734-30d5c00c7bdf11c1.js
/_next/static/chunks/pages/share/%5B%5B...shareParams%5D%5D-44619ef92ec8f3b5.js
/_next/static/a3Jc7aP-UMfeR9s4-iLvW/_buildManifest.js
/_next/static/a3Jc7aP-UMfeR9s4-iLvW/_ssgManifest.js

@suntong
Copy link
Owner

suntong commented Jun 14, 2023

Hmm... please try giving a minimal reproducible example. Else I won't be able to guess what the problem is.

@suntong
Copy link
Owner

suntong commented Jun 14, 2023

Also, you didn't say which version you're using. Are you using the v1.2.7?

Sorry, trying to rush out some reply before getting back to my burning issue at hand...

@0xdevalias
Copy link
Contributor

0xdevalias commented Jun 16, 2023

Hmm... please try giving a minimal reproducible example. Else I won't be able to guess what the problem is.

<html>
<head>
  <script src="foo.js"></script>
  <script src="bar.js"></script>
  <script src="baz.js"></script>
</head>
</html>
⇒ pbpaste | cascadia --in --out --css 'html > head > script' --piece url='attr[src]'
url




⇒ pbpaste | cascadia --in --out --css 'html > head > script' --piece url='attr[src]:script'
url




⇒ pbpaste | cascadia --in --out --css 'html > head' --piece url='attr[src]:script'
url
foo.js

Expected outcome:

url
foo.js
bar.js
baz.js

Also, you didn't say which version you're using. Are you using the v1.2.7?

Version 1.2.7 built on 2023-01-08

@suntong
Copy link
Owner

suntong commented Jun 16, 2023

Indeed, it looks like a bug. Will look into it when I have some time. Meanwhile,

CC: @himcc, I've replicated problem @0xdevalias reported, seems to be logic problem with attr selection. Do you have some time to look into it please?

Hope the final interface would be:
--css 'html > head > script' --piece url='attr[src]', which look more straightforward than the latter two...

@suntong suntong reopened this Jun 16, 2023
@0xdevalias
Copy link
Contributor

0xdevalias commented Jun 16, 2023

Hope the final interface would be:
--css 'html > head > script' --piece url='attr[src]', which look more straightforward than the latter two...

While it would sort of be a breaking change, in a sense it feels like the attr prefix is sort of redundant as well, particularly given in the implementation in cascadia it's getting used in a seperate context (--piece) rather than the main --css (whereas with pup it sort of needs to differentiate itself since it all appears in the one query)

The initial thing I would have expected/tried was just being able to use standard CSS attribute selector syntax within --piece, eg [src] (or perhaps also script[src] if you wanted to support the full syntax there as well):

That way the --piece syntax would sort of end up being much closer to just standard CSS selector usage. And I could do something like:

⇒ pbpaste | cascadia --in --out --css 'html > head > script' --piece url='[src]'

⇒ pbpaste | cascadia --in --out --css 'html > head' --piece url='script[src]'

# etc

Skimming the codebase, the following areas look like they would be relevant/related to these changes:

@0xdevalias
Copy link
Contributor

@suntong @himcc I haven't looked too deeply at the code/tested this assumption, but from a quick skim I noticed that what appears to be the section handling --piece hardcodes operating on cssa[0]:

cascadia/cascadia_main.go

Lines 165 to 200 in 4b56cde

} else {
// have sub CSS selectors within -css -- block selection mode
// fmt.Printf("%v\n", piece)
// https://godoc.org/github.com/PuerkitoBio/goquery
// for debug
//doc, err := goquery.NewDocumentFromReader(strings.NewReader(testhtml))
doc, err := goquery.NewDocumentFromReader(bi)
abortOn("Input", err)
// Print csv headers
for _, key := range piece.Keys {
fmt.Fprintf(bw, "%s%s", key, deli)
}
fmt.Fprintf(bw, "\n")
// Process each item block
doc.Find(cssa[0]).Each(func(index int, item *goquery.Selection) {
//fmt.Printf("] #%d: %s\n", index, item.Text())
for _, key := range piece.Keys {
//fmt.Printf("] %s: %s\n", key, piece.Values[key])
switch piece.OutputStyles[key] {
case OutputStyleRAW:
html.Render(bw, item.Find(piece.Values[key]).Get(0))
fmt.Fprintf(bw, deli)
case OutputStyleATTR:
fmt.Fprintf(bw, "%s%s",
item.Find(piece.Values[key]).AttrOr(piece.AttrName[key], ""), deli)
case OutputStyleTEXT:
fmt.Fprintf(bw, "%s%s",
item.Find(piece.Values[key]).Contents().Text(), deli)
}
}
fmt.Fprintf(bw, "\n")
})
}

Whereas the seemingly non---piece code uses for _, css := range cssa

cascadia/cascadia_main.go

Lines 127 to 165 in 4b56cde

if len(piece.Values) == 0 {
// no sub CSS selectors -- none-block selection mode
if textOut {
doc, err := goquery.NewDocumentFromReader(bi)
abortOn("Input", err)
for _, css := range cssa {
// Process each item block
doc.Find(css).Each(func(index int, item *goquery.Selection) {
//fmt.Printf("] #%d: %s\n", index, item.Text())
if textRaw {
fmt.Fprintf(bw, "%s%s",
item.Text(), deli)
} else {
fmt.Fprintf(bw, "%s%s",
strings.TrimSpace(item.Text()), deli)
}
fmt.Fprintf(bw, "\n")
})
}
} else {
doc, err := html.Parse(bi)
abortOn("Input", err)
for _, css := range cssa {
c, err := cascadia.Compile(css)
abortOn("CSS Selector string "+css, err)
// https://godoc.org/github.com/andybalholm/cascadia
ns := c.MatchAll(doc)
if !beQuiet {
fmt.Fprintf(os.Stderr, "%d elements for '%s':\n", len(ns), css)
}
for _, n := range ns {
html.Render(bw, n)
fmt.Fprintf(bw, "\n")
}
}
}
} else {

Based on that, and the fact that in my above pbpaste | cascadia --in --out --css 'html > head' --piece url='attr[src]:script' example (Ref) it only output the first item; I suspect that this is where the current bug exists; and my assumption is that it would be fixed by also using code similar to for _, css := range cssa here too instead of cssa[0]

(Though personally, I still think it would be useful to simplify and improve the --piece syntax as well if we can)

@suntong
Copy link
Owner

suntong commented Jun 16, 2023

it only output the first item; I suspect that this is where the current bug exists

Yep, I thought so too. Thanks for investating.

That way the --piece syntax would sort of end up being much closer to just standard CSS selector usage.

That's a brilliant idea. I've been searching for CSS Attribute selectors before many times, but all conclusings had been that it is not supported. Yeah, I fully agree with you that we should use the CSS Attribute selectors syntax instead.

@suntong
Copy link
Owner

suntong commented Jun 16, 2023

Ops,

I've been searching for CSS Attribute selectors before many times, but all conclusings had been that it is not supported

I didn't look into the url closely, but having check it out again just now, I found that the a[title] in CSS Attribute selectors means <a> elements with a title attribute, IE, how to select the <a> elements, while here we need a syntax to return attributes, which is not supported by CSS selectors (still).

Will think it over...

@0xdevalias
Copy link
Contributor

I didn't look into the url closely, but having check it out again just now, I found that the a[title] in CSS Attribute selectors means <a> elements with a title attribute, IE, how to select the <a> elements, while here we need a syntax to return attributes, which is not supported by CSS selectors (still).

@suntong Yup, that is how it works in normal CSS selector usage, and is how it would need to work (and just how it does work I believe) in the --query part of cascadia; but what I was proposing above is that we could leverage the same familiar semantics of that syntax, but when used within --piece, we could have it output the actual attribute being described.

So basically:

  • --query:
    • [foo] would return all elements that have an attribute named foo
    • a[foo] would return all elements that have an a attribute named foo
    • etc
  • --piece
    • [foo] would return the attribute named foo from all elements that have it
    • a[foo] would return the attribute named foo from all a elements that have it
    • etc

So then combined, you could use these like I described above:

That way the --piece syntax would sort of end up being much closer to just standard CSS selector usage. And I could do something like:

⇒ pbpaste | cascadia --in --out --css 'html > head > script' --piece url='[src]'

⇒ pbpaste | cascadia --in --out --css 'html > head' --piece url='script[src]'

# etc

Originally posted by @0xdevalias in #3 (comment)

@0xdevalias
Copy link
Contributor

Another alternative would be to re-consider how --piece works in terms of the 'prior art' from pup, and how it has various 'Display Functions' (as they call them)

They have:

  • text{}
  • attr{attrkey}
  • json{}

Personally I don't think json{} makes a lot of sense here (you could just run the HTML through an XML -> JSON tool).

I'm pretty sure the --text mode already covers what text{} would do.

So it basically seems to just leave the attr{attrkey} version.

Though personally I like the idea of keeping the CSS selector --query as 'pure selectors' (unlike pup also adding in the 'Display Functions') there.

@suntong
Copy link
Owner

suntong commented Jun 22, 2023

Thanks for the great input. Make sense.
I might not be able to look into in 2~3 weeks, but I will...

@0xdevalias
Copy link
Contributor

0xdevalias commented Jun 22, 2023

Thanks for the great input. Make sense.
I might not be able to look into in 2~3 weeks, but I will...

@suntong No worries, I appreciate it :)

@0xdevalias
Copy link
Contributor

Another source of 'prior art', xq just implemented this recently; you can see their approach on this comment (and in the commits linked later in the timeline):

@suntong
Copy link
Owner

suntong commented Jun 29, 2023

pbpaste | cascadia --in --out --css 'html > head > script' --piece url='attr[src]'

See

"-i", "opt_piece_script.html", "-o", "-c", "html > head > script", "-p", "SourceJS=ATTR:src",

SourceJS
foo.js
bar.js
baz.js

@0xdevalias
Copy link
Contributor

Awesome! Will have to check it out once a new release is made! Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants