Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messed up Debugger using Scrapy #238

Closed
oli06 opened this issue May 12, 2020 · 3 comments
Closed

Messed up Debugger using Scrapy #238

oli06 opened this issue May 12, 2020 · 3 comments
Labels
bug Something isn't working

Comments

@oli06
Copy link

oli06 commented May 12, 2020

Environment data

  • VS Code version: 1.45.0
  • Extension version (available under the Extensions sidebar): 2020.4.76186
  • OS and version: macOS Catalina 10.15.4
  • Python version (& distribution if applicable, e.g. Anaconda): 3.7.6
  • Type of virtual environment used (N/A | venv | virtualenv | conda | ...): N/A
  • Relevant/affected Python packages and their versions: Scrapy 1.6.0
  • Relevant/affected Python-related VS Code extensions and their versions: XXX
  • Jedi or Language Server? (i.e. what is "python.jediEnabled" set to; more info How to update the language server to the latest stable version vscode-python#3977): XXX
  • Value of the python.languageServer setting: Microsoft Python Language Server version 0.5.45.0

Expected behaviour

Debug as usual.

Actual behaviour

Debugger is messed up: Running lines of code one or two steps later or crash the python interpreter with errors that make no sense at all.

The interesting thing is that if i use print("foo") the debugger works fine again. If I remove the print() the debugger messes up again.
I´m not 100 per cent sure that this issue is vscode related but maybe my wrong implementation of code.

Steps to reproduce:

  1. Create a new Scrapy Spider
  2. Add 'https://www.tagesschau.de/wirtschaft/coronavirus-fleischbetrieb-103.html' to the urls list.
  3. Add code below to the parse method
def parse(self, response):
        url = response.request.url

        named_references = {}
        text = ""

        text_div = content.css('div.storywrapper div.sectionZ div.con div.modCon div.modParagraph')
        for tag in text_div:
            nodes = tag.xpath('./node()')
            nodes_name = nodes.xpath('name()').get()

            for node in nodes:
                child_nodes = node.xpath('./node()')                
                for c in child_nodes:


#print('debugger works again, if i uncomment this')
#most of the time node_name is executed one line later than it should

                    node_name = c.xpath('name()').get()
                    if node_name == 'a':
                        href = c.xpath('@href').get() #relative href
#if the debugger / python interpreter crashes totally, it happens here
                       named_references[c.xpath('text()').get().strip('\n') if c.xpath('text()').get() is not None else 'unknown'] = href #save the link in the dict, key is the text (if it exists)
                        yield response.follow(c, callback=self.parseArticle)
                    elif node_name == None:
                        text += c.get().strip('\n')

        yield {'url': url, 'named_references': named_references, 'text': text}
  1. Run Spider with scrapy crawl __spidername__

Logs

Here is an example of the messed up debugger:

for x in article_item['named_references']: #line 113, setting x to a key
    ref_url = article_item['named_references'][x]            
    yield response.follow(article_item['named_references'][x], callback=self.parseArticle) #this is line 115

File ".../spiders/tagesschau_spider.py", line 115, in parseArticle yield response.follow(article_item['named_references'][x], callback=self.parseArticle) UnboundLocalError: local variable 'x' referenced before assignment

@karthiknadig karthiknadig transferred this issue from microsoft/vscode-python May 12, 2020
@int19h
Copy link
Contributor

int19h commented Jun 16, 2020

Where does the second code snippet come from? (or where is it supposed to be placed for the repro?)

@int19h int19h added the bug Something isn't working label Jun 16, 2020
@oli06
Copy link
Author

oli06 commented Jun 18, 2020

It was written by me.
named_references is a dictionary containing hyperlinks from articles:

named_references = { "The german football Bundesliga": "https://www.bundesliga.com/de/bundesliga", "German champion 2020 is Bayern Munich": "https://www.faz.net/aktuell/sport/fussball/bundesliga/fussball-bundesliga-bayern-muenchen-ist-deutscher-meister-2020-16809640.html"}

I then iterate over each key-value pair and crawl the url listed in each key-value pair.

@int19h
Copy link
Contributor

int19h commented Sep 23, 2020

I believe this was the same as #348, and is now fixed. Please re-open if it still repros with the most recent version of the debugger.

@int19h int19h closed this as completed Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants