Assignment 2 #6

raghavsi · 2018-09-05T09:49:23Z

Scrape pepperfry to create a dataset of these categories:
2-seater-sofa
bench
book-cases
coffee-table
dining-set
queen-beds
arm-chairs
chest-drawers
garden-seating
bean-bags
king-beds

For each category I want students to extract upto 20 items (and not less than 10 items).
For each item download more than one image (each item has multiple images in different pose)
and whatever meta-data is available.
(meta data may also be available in url)

To do this we usually use scrapy:
https://doc.scrapy.org/en/latest/intro/tutorial.html

The idea is to find for each category a link, from each link a set of links for items and recursively parse them.

The result should be dumped in a file structure:
category_name_dir
item_name_dir
item_image_1
item_image_2
metadata.txt

Deadline for this is next Wednesday : 12 Sept 2018. Please reach out to TAs for help and submission guidelines.
Post here for clarification.

VarunSrivastavaIITD · 2018-09-07T17:56:13Z

On trying out the scrapy shell on the homepage 'https://www.pepperfry.com/', i.e. "scrapy shell https://www.pepperfry.com/ --nolog" I get a 403 access denied response.

As suggested both on the scrapy docs as well as SO, this is probably due to anti scraping measures taken by the site. Can anyone please confirm if they are facing the same issue, since I assume getting around such measures wasn't the intention of the assignment?

P.S. To ensure there was nothing wrong with my scrapy setup, I tried the same on other websites too, which gave no such problems.

ankursharma-iitd · 2018-09-08T17:09:39Z

@VarunSrivastavaIITD Add the corresponding user agent, and it will work while scraping from shell.
https://stackoverflow.com/questions/48033398/unable-to-scrape-snapdeal-data-using-scrapy
Can someone please elaborate the dataset format and the corresponding file structure?

VarunSrivastavaIITD · 2018-09-08T23:27:37Z

@ankursharma-iitd Thanks a lot, it works now.

raghavsi · 2018-09-09T05:03:36Z

Please reach out to TAs also if you have issues.
PepperFry_data/
Bench/
Item1/
Image1.jpg
Image2.jpg
...
metadata.txt

metadata can be just text, or it can be <tag: text> if tags are available
If metadata is available or stored in json format, that is also acceptable.
BTW there may be some metadata available for each image, e.g. image1 is "front", image2 is "back" -- that could also become useful.

abhudev · 2018-09-09T12:10:16Z

Do we have to download just two images per object, or as many as are there?

Adi-iitd · 2018-09-09T14:05:06Z

@Maxaravind @VinayKyatham could you please tell us the deadline (time and date) for this assignment?

Maxaravind · 2018-09-09T18:04:39Z

Hi all,

The deadline for submitting the assignment is 11:59 pm 12th September. Please make your submission as a tarball/zip file named as Assignment_2_ELL881_. Please make the subject of your email as Assignment_2_ELL881.

Thanks

utkarsh1097 · 2018-09-10T06:26:51Z

Is it mandatory to use scrapy? Can we not use bs4 or maybe selenium?

Maxaravind · 2018-09-10T11:01:24Z

@utkarsh1097 You are free to choose a tool of your choice. The only restriction is that you should be able to create a python script that can do the task automatically. You are not supposed to use gui based tools for scrapping. You have to submit the python script for collecting data as part of the assignment. So whatever python library is OK. But we highly recommend that you go with scrapy.

anshumitts · 2018-09-10T14:23:01Z

Please give the directory structure for submission as I am bit confused. What do I have to submit?
Python code or meta_data or both? Also please specify the ID where we have to sumit the data.

Maxaravind · 2018-09-10T16:16:14Z

@anshumitts

This is the directory structure:-

PepperFry_data/ --Top level directory
Bench/ -- Category
Item1/ --Items
Image1.jpg
Image2.jpg
...
metadata.txt --one and only one should be present in each item folder

And regarding the submission, we will update you soon.

anshumitts · 2018-09-10T16:58:36Z

@Maxaravind here it is mentioned we have to submit just one script. and what is expected in metadata.txt. It is not clear here. also isn't meta for one class instead of 1 item?

abhudev · 2018-09-11T11:19:37Z

@Maxaravind @VinayKyatham there is an issue in the scrapy image pipeline - it cannot download many of the images in pepperfry.com, giving the following error:

OSError: cannot identify image file <_io.BytesIO object at 0x000001E3E7A94678>

When I request the image URL and try to write the body of the request to image file, the image is not recognized by windows image viewer, but can be viewed by VScode.

However, when I use urllib to download the image and write the response to file, it is readable by windows image viewer. The problem with this approach is that it is much much slower than downloading using the request mechanism of scrapy.

One of the images for which this issue was coming:
https://ii1.pepperfry.com/media/catalog/product/l/o/494x544/lounge-chair-in-high-quality-wicker-by-ventura-lounge-chair-in-high-quality-wicker-by-ventura-ertieh.jpg

Interestingly, even when I download this image from browser, it is not recognized by Windows Image viewer (or even PIL in python) but is opened in VScode.

(I am thinking it could also be an issue with some images in pepperfry.com, as I have not seen this problem yet on other websites)

abhudev · 2018-09-11T12:11:43Z

Turns out the images are webp images, and the Pillow installed on my system isn't able to read them. An online tool (+Google Chrome) was able to read it as webp, so I am assuming this is not a problem.

vineetm added instructor-notes Instructor Notes assignment labels Sep 5, 2018

vineetm mentioned this issue Sep 10, 2018

Issue in scraping pepperfry #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment 2 #6

Assignment 2 #6

raghavsi commented Sep 5, 2018

VarunSrivastavaIITD commented Sep 7, 2018

ankursharma-iitd commented Sep 8, 2018 •

edited

Loading

VarunSrivastavaIITD commented Sep 8, 2018

raghavsi commented Sep 9, 2018

abhudev commented Sep 9, 2018

Adi-iitd commented Sep 9, 2018

Maxaravind commented Sep 9, 2018

utkarsh1097 commented Sep 10, 2018

Maxaravind commented Sep 10, 2018

anshumitts commented Sep 10, 2018

Maxaravind commented Sep 10, 2018

anshumitts commented Sep 10, 2018 •

edited

Loading

abhudev commented Sep 11, 2018 •

edited

Loading

abhudev commented Sep 11, 2018 •

edited

Loading

Assignment 2 #6

Assignment 2 #6

Comments

raghavsi commented Sep 5, 2018

VarunSrivastavaIITD commented Sep 7, 2018

ankursharma-iitd commented Sep 8, 2018 • edited Loading

VarunSrivastavaIITD commented Sep 8, 2018

raghavsi commented Sep 9, 2018

abhudev commented Sep 9, 2018

Adi-iitd commented Sep 9, 2018

Maxaravind commented Sep 9, 2018

utkarsh1097 commented Sep 10, 2018

Maxaravind commented Sep 10, 2018

anshumitts commented Sep 10, 2018

Maxaravind commented Sep 10, 2018

anshumitts commented Sep 10, 2018 • edited Loading

abhudev commented Sep 11, 2018 • edited Loading

abhudev commented Sep 11, 2018 • edited Loading

ankursharma-iitd commented Sep 8, 2018 •

edited

Loading

anshumitts commented Sep 10, 2018 •

edited

Loading

abhudev commented Sep 11, 2018 •

edited

Loading

abhudev commented Sep 11, 2018 •

edited

Loading