Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opengraph.Fetch returns nothing for a few domains #33

Open
Pancham97 opened this issue Aug 26, 2023 · 6 comments
Open

opengraph.Fetch returns nothing for a few domains #33

Pancham97 opened this issue Aug 26, 2023 · 6 comments

Comments

@Pancham97
Copy link

I have been using this package to fetch opengraph info about websites and articles, but for a few websites, e.g. FastCompany, the Fetch() method returns nothing. After some research, I found that few websites block bots from scraping their content. However, when I try Raycast preview, or even macOS preview, it successfully fetches the metadata with the image and title. How can I achieve that? Here's how my code looks:

package api

import (
	"net/http"
	"mypackage/read-it-later/structs"
	"mypackage/read-it-later/utils"

	"github.com/gin-gonic/gin"
	"github.com/oklog/ulid/v2"
	"github.com/otiai10/opengraph"
)

func StoreEntity(c *gin.Context) {
	var requestBody structs.RequestURL
	if err := c.BindJSON(&requestBody); err != nil {
		return
	}

	c.Header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36")

	ogp, err := opengraph.Fetch(requestBody.URL)

	if err != nil {
		c.JSON(http.StatusInternalServerError, err)
	}

	c.JSON(http.StatusOK, gin.H{
		"id":          ulid.Make(),
		"description": ogp.Description,
		"favicon":     ogp.Favicon,
		"image":       utils.FetchImageURL(ogp.Image, ogp.Favicon, ogp.URL.Host, ogp.URL.Scheme),
		"siteName":    ogp.SiteName,
		"title":       ogp.Title,
		"type":        ogp.Type,
		"URL":         ogp.URL,
		"all_info":    ogp,
	})
}
@otiai10
Copy link
Owner

otiai10 commented Aug 28, 2023

Thank you, @Pancham97

  1. Give me the actual URL you are mentioning
  2. What are your rationale that you found the websites are blocking bot?

@Pancham97
Copy link
Author

Hey @otiai10, sorry, forgot to add them. Here are a couple that I didn't seem to get working:

  1. https://www.fastcompany.com/90945102/ai-chatbots-health-medicine-chatgpt-webmd-self-diagnosis-misinformation
  2. https://www.nplusonemag.com/issue-25/on-the-fringe/uncanny-valley/

What are your rationale that you found the websites are blocking bot?

I am not sure. Maybe they don't want unnecessary website scraping or something. Plus, a few websites serve content via JavaScript, and that could be an issue too? 🤷

@otiai10
Copy link
Owner

otiai10 commented Aug 28, 2023

This works:

package main

import (
	"compress/gzip"
	"encoding/json"
	"log"
	"net/http"
	"os"

	"github.com/otiai10/opengraph"
)

func main() {

	target := "https://www.fastcompany.com/90945102/ai-chatbots-health-medicine-chatgpt-webmd-self-diagnosis-misinformation"

	// 1) Necessary headers
	headers := map[string]string{
		"Accept":          "text/html",
		"Accept-Encoding": "gzip",
		"User-Agent":      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
	}

	req, _ := http.NewRequest("GET", target, nil)
	for k, v := range headers {
		req.Header.Set(k, v)
	}

	// 2) Necessary cookies (set by geo.capthca-delivery.com)
	req.AddCookie(&http.Cookie{
		Name:  "datadome", // See your browser's cookie with this name
		Value: "2ZnfSBOvZs1C2ZURicdpZAkZ-86xXY_RyRG-D6E8CjiNpgopXq7byBj5KmkCtLmcjRGjeGpzkBmP0JvFmKwUxazBMrGTkpY8-K9mJdGxD8WYobZ5QmI76Uqdhgf6Wvdi",
	})

	res, err := http.DefaultClient.Do(req)
	if err != nil {
		log.Println(1001, err)
		return
	}
	defer res.Body.Close()

	if res.StatusCode != 200 {
		log.Println(1005, "Status code is not 200")
		log.Println("Status:", res.StatusCode)
		log.Println("Content-Type:", res.Header.Get("Content-Type"))
		log.Println("Content-Encoding:", res.Header.Get("Content-Encoding"))
		return
	}

	reader, err := gzip.NewReader(res.Body)
	if err != nil {
		log.Println(1002, err)
	}
	defer reader.Close()

	// Use "Parse" for the io.Reader
	ogp := opengraph.New(target)
	if err := ogp.Parse(reader); err != nil {
		log.Println(1004, err)
		return
	}

	// Then let's check it out!
	enc := json.NewEncoder(os.Stdout)
	enc.SetIndent("", "  ")
	enc.Encode(ogp)
}

@otiai10
Copy link
Owner

otiai10 commented Aug 28, 2023

There might be various reasons that this package opengraph cannot fetch information.

(not holistic)

  1. Just the way they are:
    a. Just missing OGP 😝
    b. Client-side rendering
    c. etc...
  2. Content control for security and reliability reasons
    a. User-Agent
    b. Human auth (e.g., captcha)
    c. etc...

Then, in your case with fastcompany.com, 2-a and 2-b of the list above matter.

@Pancham97
Copy link
Author

Hey @otiai10, thanks, but for some reason, I can't seem to get it working. I am a bit new to Go so might be missing something obvious, but I am getting the error 1005. I have replaced the value of the datadome cookie with my browser cookie, and yet it does not work. Please help.

2023/08/29 23:24:11 1005 Status code is not 200
2023/08/29 23:24:11 Status: 403
2023/08/29 23:24:11 Content-Type: text/html;charset=utf-8
2023/08/29 23:24:11 Content-Encoding:

@otiai10
Copy link
Owner

otiai10 commented Aug 30, 2023

  1. Check the status text of the response
  2. Check the response body with 403 if exists
  3. Tweak headers more
  4. Tweak cookies more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants