Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment 2 #6

Open
raghavsi opened this issue Sep 5, 2018 · 14 comments
Open

Assignment 2 #6

raghavsi opened this issue Sep 5, 2018 · 14 comments
Labels

Comments

@raghavsi
Copy link
Collaborator

raghavsi commented Sep 5, 2018

Scrape pepperfry to create a dataset of these categories:
2-seater-sofa
bench
book-cases
coffee-table
dining-set
queen-beds
arm-chairs
chest-drawers
garden-seating
bean-bags
king-beds

For each category I want students to extract upto 20 items (and not less than 10 items).
For each item download more than one image (each item has multiple images in different pose)
and whatever meta-data is available.
(meta data may also be available in url)

To do this we usually use scrapy:
https://doc.scrapy.org/en/latest/intro/tutorial.html

The idea is to find for each category a link, from each link a set of links for items and recursively parse them.

The result should be dumped in a file structure:
category_name_dir
item_name_dir
item_image_1
item_image_2
metadata.txt

Deadline for this is next Wednesday : 12 Sept 2018. Please reach out to TAs for help and submission guidelines.
Post here for clarification.

@VarunSrivastavaIITD
Copy link

On trying out the scrapy shell on the homepage 'https://www.pepperfry.com/', i.e. "scrapy shell https://www.pepperfry.com/ --nolog" I get a 403 access denied response.

As suggested both on the scrapy docs as well as SO, this is probably due to anti scraping measures taken by the site. Can anyone please confirm if they are facing the same issue, since I assume getting around such measures wasn't the intention of the assignment?

P.S. To ensure there was nothing wrong with my scrapy setup, I tried the same on other websites too, which gave no such problems.

@ankursharma-iitd
Copy link

ankursharma-iitd commented Sep 8, 2018

@VarunSrivastavaIITD Add the corresponding user agent, and it will work while scraping from shell.
https://stackoverflow.com/questions/48033398/unable-to-scrape-snapdeal-data-using-scrapy
Can someone please elaborate the dataset format and the corresponding file structure?

@VarunSrivastavaIITD
Copy link

@ankursharma-iitd Thanks a lot, it works now.

@raghavsi
Copy link
Collaborator Author

raghavsi commented Sep 9, 2018

Please reach out to TAs also if you have issues.
PepperFry_data/
Bench/
Item1/
Image1.jpg
Image2.jpg
...
metadata.txt

metadata can be just text, or it can be <tag: text> if tags are available
If metadata is available or stored in json format, that is also acceptable.
BTW there may be some metadata available for each image, e.g. image1 is "front", image2 is "back" -- that could also become useful.

@abhudev
Copy link

abhudev commented Sep 9, 2018

Do we have to download just two images per object, or as many as are there?

@Adi-iitd
Copy link

Adi-iitd commented Sep 9, 2018

@Maxaravind @VinayKyatham could you please tell us the deadline (time and date) for this assignment?

@Maxaravind
Copy link

Hi all,

The deadline for submitting the assignment is 11:59 pm 12th September. Please make your submission as a tarball/zip file named as Assignment_2_ELL881_. Please make the subject of your email as Assignment_2_ELL881.

Thanks

@utkarsh1097
Copy link

Is it mandatory to use scrapy? Can we not use bs4 or maybe selenium?

@Maxaravind
Copy link

@utkarsh1097 You are free to choose a tool of your choice. The only restriction is that you should be able to create a python script that can do the task automatically. You are not supposed to use gui based tools for scrapping. You have to submit the python script for collecting data as part of the assignment. So whatever python library is OK. But we highly recommend that you go with scrapy.

@anshumitts
Copy link

Please give the directory structure for submission as I am bit confused. What do I have to submit?
Python code or meta_data or both? Also please specify the ID where we have to sumit the data.

@Maxaravind
Copy link

@anshumitts

This is the directory structure:-

PepperFry_data/ --Top level directory
Bench/ -- Category
Item1/ --Items
Image1.jpg
Image2.jpg
...
metadata.txt --one and only one should be present in each item folder

And regarding the submission, we will update you soon.

@anshumitts
Copy link

anshumitts commented Sep 10, 2018

@Maxaravind here it is mentioned we have to submit just one script. and what is expected in metadata.txt. It is not clear here. also isn't meta for one class instead of 1 item?

@abhudev
Copy link

abhudev commented Sep 11, 2018

@Maxaravind @VinayKyatham there is an issue in the scrapy image pipeline - it cannot download many of the images in pepperfry.com, giving the following error:

OSError: cannot identify image file <_io.BytesIO object at 0x000001E3E7A94678>

When I request the image URL and try to write the body of the request to image file, the image is not recognized by windows image viewer, but can be viewed by VScode.

However, when I use urllib to download the image and write the response to file, it is readable by windows image viewer. The problem with this approach is that it is much much slower than downloading using the request mechanism of scrapy.

One of the images for which this issue was coming:
https://ii1.pepperfry.com/media/catalog/product/l/o/494x544/lounge-chair-in-high-quality-wicker-by-ventura-lounge-chair-in-high-quality-wicker-by-ventura-ertieh.jpg

Interestingly, even when I download this image from browser, it is not recognized by Windows Image viewer (or even PIL in python) but is opened in VScode.

(I am thinking it could also be an issue with some images in pepperfry.com, as I have not seen this problem yet on other websites)

@abhudev
Copy link

abhudev commented Sep 11, 2018

Turns out the images are webp images, and the Pillow installed on my system isn't able to read them. An online tool (+Google Chrome) was able to read it as webp, so I am assuming this is not a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants