-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assignment 2 #6
Comments
On trying out the scrapy shell on the homepage 'https://www.pepperfry.com/', i.e. "scrapy shell https://www.pepperfry.com/ --nolog" I get a 403 access denied response. As suggested both on the scrapy docs as well as SO, this is probably due to anti scraping measures taken by the site. Can anyone please confirm if they are facing the same issue, since I assume getting around such measures wasn't the intention of the assignment? P.S. To ensure there was nothing wrong with my scrapy setup, I tried the same on other websites too, which gave no such problems. |
@VarunSrivastavaIITD Add the corresponding user agent, and it will work while scraping from shell. |
@ankursharma-iitd Thanks a lot, it works now. |
Please reach out to TAs also if you have issues. metadata can be just text, or it can be <tag: text> if tags are available |
Do we have to download just two images per object, or as many as are there? |
@Maxaravind @VinayKyatham could you please tell us the deadline (time and date) for this assignment? |
Hi all, The deadline for submitting the assignment is 11:59 pm 12th September. Please make your submission as a tarball/zip file named as Assignment_2_ELL881_. Please make the subject of your email as Assignment_2_ELL881. Thanks |
Is it mandatory to use scrapy? Can we not use bs4 or maybe selenium? |
@utkarsh1097 You are free to choose a tool of your choice. The only restriction is that you should be able to create a python script that can do the task automatically. You are not supposed to use gui based tools for scrapping. You have to submit the python script for collecting data as part of the assignment. So whatever python library is OK. But we highly recommend that you go with scrapy. |
Please give the directory structure for submission as I am bit confused. What do I have to submit? |
This is the directory structure:- PepperFry_data/ --Top level directory And regarding the submission, we will update you soon. |
@Maxaravind here it is mentioned we have to submit just one script. and what is expected in metadata.txt. It is not clear here. also isn't meta for one class instead of 1 item? |
@Maxaravind @VinayKyatham there is an issue in the scrapy image pipeline - it cannot download many of the images in pepperfry.com, giving the following error: OSError: cannot identify image file <_io.BytesIO object at 0x000001E3E7A94678> When I request the image URL and try to write the body of the request to image file, the image is not recognized by windows image viewer, but can be viewed by VScode. However, when I use urllib to download the image and write the response to file, it is readable by windows image viewer. The problem with this approach is that it is much much slower than downloading using the request mechanism of scrapy. One of the images for which this issue was coming: Interestingly, even when I download this image from browser, it is not recognized by Windows Image viewer (or even PIL in python) but is opened in VScode. (I am thinking it could also be an issue with some images in pepperfry.com, as I have not seen this problem yet on other websites) |
Turns out the images are webp images, and the Pillow installed on my system isn't able to read them. An online tool (+Google Chrome) was able to read it as webp, so I am assuming this is not a problem. |
Scrape pepperfry to create a dataset of these categories:
2-seater-sofa
bench
book-cases
coffee-table
dining-set
queen-beds
arm-chairs
chest-drawers
garden-seating
bean-bags
king-beds
For each category I want students to extract upto 20 items (and not less than 10 items).
For each item download more than one image (each item has multiple images in different pose)
and whatever meta-data is available.
(meta data may also be available in url)
To do this we usually use scrapy:
https://doc.scrapy.org/en/latest/intro/tutorial.html
The idea is to find for each category a link, from each link a set of links for items and recursively parse them.
The result should be dumped in a file structure:
category_name_dir
item_name_dir
item_image_1
item_image_2
metadata.txt
Deadline for this is next Wednesday : 12 Sept 2018. Please reach out to TAs for help and submission guidelines.
Post here for clarification.
The text was updated successfully, but these errors were encountered: