Skip to content
This repository was archived by the owner on May 30, 2023. It is now read-only.

Have PhantomJS return the status code #10185

Closed
fdb opened this issue Aug 4, 2011 · 29 comments
Closed

Have PhantomJS return the status code #10185

fdb opened this issue Aug 4, 2011 · 29 comments

Comments

@fdb
Copy link

fdb commented Aug 4, 2011

frede...@burocrazy.com commented:

1.2.0 (development)

It would be useful to see the status code of a page request. status is now a string which returns 'success' if the request fails, even if the web server encountered an error or the page was not found.

Here's how this might look:

page.onLoadFinished = function(status) {
  if (status === 200) {
    console.log('Page found.');
  } else if (status === 404) {
     console.log('Page not found.');
  } else {
     console.log('Other error: ' + status);
  }
}

Disclaimer:
This issue was migrated on 2013-03-15 from the project's former issue tracker on Google Code, Issue #185.
🌟   21 people had starred this issue at the time of migration.

@bchallenor
Copy link

b...@challenor.org commented:

This would be useful for me too.

@ariya
Copy link
Owner

ariya commented Aug 7, 2011

faort...@gmail.com commented:

I understand but each resource have its own status code, for example an image, an css style sheet and the main html file, etc. What of these codes you need?

@fdb
Copy link
Author

fdb commented Aug 7, 2011

frede...@burocrazy.com commented:

I'd like the HTTP status code for the requested object, i.e. the URL sent to page.open().

I need this to check if the requested URL actually exists (200) or was not found (404). As far as I know, there's no other way to check this.

@bchallenor
Copy link

b...@challenor.org commented:

I'm writing a web crawler of sorts, so for me the status code for each resource would be best. But the status code for the requested URL would be a good start. Thanks.

@ariya
Copy link
Owner

ariya commented Aug 7, 2011

faort...@gmail.com commented:

You can use something like this:

var page = new WebPage();
page.open(phantom.args[0]);

page.onResourceReceived = function(resource) {
console.log('Url: ' + resource.url);
console.log('Code: ' + resource.status);
}

@bchallenor
Copy link

b...@challenor.org commented:

Ah, sorry, I hadn't got that far in the docs yet. That will do for me, thanks. Maybe the OP would still like the page status code for simplicity though.

@fdb
Copy link
Author

fdb commented Aug 8, 2011

frede...@burocrazy.com commented:

I've tried using onResourceReceived, but when redirects happen it returns 301/302 as the status code, not the final status code. (It might redirect to a 404 page, for example).

Here's some test code to demonstrate this:

var theUrl = 'http://google.com/';
var page = new WebPage();
page.open(theUrl);
var theStatusCode = null;

page.onLoadFinished = function(status) {
if (status === 'success') {
var html = page.evaluate(function() {
return document.body.innerHTML;
});
console.log(theStatusCode + '\n' + html);
phantom.exit();
}
};

page.onResourceReceived = function(resource) {
if (resource.url == theUrl) {
theStatusCode = resource.status;
}
};

@ariya
Copy link
Owner

ariya commented Aug 8, 2011

faort...@gmail.com commented:

But I think that this behavior is rigth, because "http://google.com" redirect me (in my case) to "http://www.google.cl" then "http://google.com" have 302 and "http://www.google.cl" have 200

@fdb
Copy link
Author

fdb commented Aug 8, 2011

frede...@burocrazy.com commented:

It is right to have this status code for onResourceReceived.

However, I want to have a "final" status code that gives me the page status code after all the redirects are done. Writing this using onResourceReceived might be possible, but is not trivial.

Here's an example of how this could be implemented in QtWebkit: http://stackoverflow.com/questions/4330274/qtwebkit-how-to-check-http-status-code

@ariya
Copy link
Owner

ariya commented Aug 10, 2011

faort...@gmail.com commented:

Ok, but "http://www.google.com" and "http://www.google.cl" are two differents resources therefore have differents status code.
Try with WebInspector in "http://www.google.com" and you see two differents resources "http://www.google.com" with status code 302 and "http://www.google.cl" with status code 200, now if you check the response header in "http://www.google.com" you can see a field named "Location" with the URL where you will be redirected.

"http://www.google.com" and "http://www.google.cl" aren't the same resource with two URL

@fdb
Copy link
Author

fdb commented Aug 16, 2011

frede...@burocrazy.com commented:

I understand that this is a difficult issue with many unknowns. I also respect if you think my proposal is too "single-purpose", and that you might be looking for a more general-purpose solution.

Still, I think this particular use case might be quite difficult to write. I also think this comes up frequent enough that it might be useful to have some support for. If you have any ideas or solutions on how to solve this in a general way, I'd be happy to explore them.

@ariya
Copy link
Owner

ariya commented Aug 16, 2011

ariya.hi...@gmail.com commented:

Not possible technically right now (limitation of QtWebKit API). Will keep it for future.

 
Metadata Updates

  • Label(s) removed:
    • Type-Defect
  • Label(s) added:
    • Type-Enhancement
  • Milestone updated: FutureRelease (was: ---)

@ariya
Copy link
Owner

ariya commented Aug 22, 2011

ariya.hi...@gmail.com commented:

We probably can cache the HTTP status for the main resource and pass that along to the callback.

@ariya
Copy link
Owner

ariya commented Aug 25, 2011

james.cr...@gmail.com commented:

I'd prefer a solution that covered a wider range of page load error cases than just non-200 http error codes.

If the requested page fails to load, was it because of:

  • A transient network error (so I should retry later)
  • An http error code returned by the server (maybe retry depending on the error code)
  • An error in phantomjs (it's a bug that needs reporting and fixing)

Would something like jquery's ajax function work for the phantomjs open function?: http://api.jquery.com/jQuery.ajax

This lets you have success and error function callbacks, with the error function being passed:

  • status: "timeout", "error", "abort", and "parsererror"
  • error: textual portion of the HTTP status, such as "Not Found" or "Internal Server Error."
  • request.status: http status code

$.ajax({
type: "post", url: "http://google.com",
success: function (data, text) {
//...
},
error: function (request, status, error) {
alert(request.responseText);
}
});

So, we could have:

phantomjs.open('http://google.com", {
success: function(resource) { ... }
error: function(request, status, error) { ... }
});

@ariya
Copy link
Owner

ariya commented Sep 7, 2011

bren...@liftopia.com commented:

I would highly recommend upping the priority on this, as it is a pivotal feature to making phantomjs a real player for web testing frameworks out there. I know I need this for my own use of phantomjs, and I'm sure there are many applications that would benefit greatly.

If the solution in #c13 is viable, then I would suggest just doing it. However if you're looking for a longer-term solution, the statement in #c12 is actually incorrect. This StackOverflow question shows how it is actually possible to get the HTTP status code: http://stackoverflow.com/questions/4330274/qtwebkit-how-to-check-http-status-code

The key code is really this:

httpStatus = reply->attribute(QNetworkRequest::HttpStatusCodeAttribute).toInt();

Sending that to the callback function would be an easy way to achieve this. And if you don't want to break backwards compatibility, then just add it as a second argument.

@ariya
Copy link
Owner

ariya commented Sep 15, 2011

ariya.hi...@gmail.com commented:

Getting the status of any network reply is easy enough with Qt API. However, my comment was mostly on getting the status for the reply associated with the loaded URL because there is no API to detect that (hence, the reason "limitation"). In a web page with lots of resources, which in turn gives multiply replies, we need to manually find a way which one is the main page's reply.

@ariya
Copy link
Owner

ariya commented Sep 16, 2011

ariya.hi...@gmail.com commented:

Well, caching the resource associated with the loading URL is exactly like in comment #7.

However, the challenge is to handle redirection as well, just like in comment #8.

There is also issue 218.

@terramapserver
Copy link

cont...@wildpeaks.fr commented:

If you want to detect redirections, this is not perfect (e.g. the magic number 200 milliseconds), but you could use a timeout + onLoadStarted to detect redirections if all you're interested in is the final destination:

var timeout = false;

page.onLoadStarted = function(){
if (timeout){
console.log('Redirection detected');
window.clearTimeout(timeout);
timeout = false;
}
};

page.onLoadFinished = function (status) {
if (status !== "success") {
phantom.exit(0);
} else {
timeout = window.setTimeout(function(){
var title = page.evaluate(function(){
return document.title;
});

          console.log(title);
          phantom.exit(1);
      }, 200);
  }

};

page.open("http://www.example.com");

@xprudhomme
Copy link

xavier.p...@gmail.com commented:

Instead of having PhantomJS only returning the main ressource status code, why would it not just simply return the full main ressource HTTP response's header?

Based on what I've just been reading here, my understanding is that there are some limitations due to the QtWebKit API. However, something really basic and widely used such as an HTTP response header should be accessible to all of us. In my case, I really need to be able to retrieve not only the HTTP status code, but the response content type too, as well as other HTTP Response header's fields. I'd rather opening a new issue as an enhancement request for this...

@witsch
Copy link

witsch commented Jun 7, 2012

a...@pyfidelity.com commented:

The following variation of the code in comment 7 delivers the final status code:

var theUrl = 'http://google.com/';
var page = new WebPage();
page.open(theUrl);
var theStatusCode = null;

page.onLoadFinished = function(status) {
if (status === 'success') {
var html = page.evaluate(function() {
return document.body.innerHTML;
});
console.log(theStatusCode + '\n' + html);
phantom.exit();
}
};

page.onResourceReceived = function(resource) {
if (resource.url == theUrl || (theStatusCode >= 300 && theStatusCode < 400)) {
theStatusCode = resource.status;
}
};

The diff(erence) is:

--- first-status-code.js 2012-06-07 01:20:47.000000000 +0200
+++ final-status-code.js 2012-06-07 01:16:57.000000000 +0200
@@ -14,7 +14,7 @@
};

page.onResourceReceived = function(resource) {

  • if (resource.url == theUrl) {
  • if (resource.url == theUrl || (theStatusCode >= 300 && theStatusCode < 400)) {
    theStatusCode = resource.status;
    }
    };

@ariya
Copy link
Owner

ariya commented Sep 20, 2012

igo...@gmail.com commented:

if you want to detect the status code of the final resource (after redirects) you can set a handler on onUrlChanged; the final url is the one you want to check the status code for.

in my case, i was running into problems with urls with query strings. i would send a url with, for instance, '%24' in the query string. but the onUrlChanged handler would get the url with those escape codes replaced back with '/'. the onResourceReceived handler would continue getting urls with %24.

this might belong as a separate issue, but it made it impossible for me to figure out the final status code in many cases.

@JamesMGreene
Copy link
Collaborator

james.m....@gmail.com commented:

Igor:
Could you please file a separate issue for the query string encoding differences in these handlers? Thank you!

@JamesMGreene
Copy link
Collaborator

james.m....@gmail.com commented:

Issue 334 has been merged into this issue.

@tommoor
Copy link

tommoor commented Feb 1, 2013

tom.moor@gmail.com commented:

Hey guys, any thoughts on progress of this feature - it would be great to know whether the webpage has loaded successfully or not.

@mbrio
Copy link

mbrio commented Mar 27, 2014

Looking at what has been posted above, couldn't you keep track of all the responses and url changes to determine what the final page is?

var url = 'http://www.google.com',
      pageResponses = {},
      finalUrl = url;

page.onResourceReceived = function(response) {
  pageResponses[response.url] = response;
};

page.onUrlChanged = function(targetUrl) {
  finalUrl = targetUrl;
};

page.open(url, function(status) {
  var pageResponse = pageResponses[finalUrl];
});

kevinschaul added a commit to kevinschaul/depict that referenced this issue Sep 25, 2014
This is currently very annoying in phantomjs. For more information, see
this issue: ariya/phantomjs#10185
@zackw zackw removed this from the FutureRelease milestone Apr 19, 2015
@zackw
Copy link
Contributor

zackw commented Apr 19, 2015

I have this problem too -- my crawler script ( https://github.com/zackw/tbbscraper/blob/master/scripts/pj-trace-redir.js ) goes to considerable length to figure out the "right" status for the top-level page, and I'm still not sure it's 100% accurate.

@zackw
Copy link
Contributor

zackw commented Apr 20, 2015

Issue #10188, which I've just closed as a duplicate of this one, proposes to attach the resource object for the current page (that is, the information passed to onResourceReceived) to the WebPage object. I think that's enough design that someone could sit down and try to crank out some code. It might even be me, if no one gets to it first.

@Zertz
Copy link

Zertz commented May 9, 2015

@mbrio Clever! Seems to work great in my own testing.

@ghost ghost removed 2.0 labels Jan 10, 2018
@ariya
Copy link
Owner

ariya commented Dec 25, 2019

Due to our very limited maintenance capacity (see #14541 for more details), we need to prioritize our development focus on other tasks. Therefore, this issue will be closed. In the future, if we see the need to attend to this issue again, then it will be reopened.
Thank you for your contribution!

@ariya ariya closed this as completed Dec 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests