Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PDF extractor #557

Merged
merged 1 commit into from
Jul 11, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Auto annotation using Faster R-CNN with Inception v2 (utils/open_model_zoo)
- Auto annotation using Pixel Link mobilenet v2 - text detection (utils/open_model_zoo)
- Ability to create a custom extractors for unsupported media types
- Added in PDF extractor

### Changed
- Outside and keyframe buttons in the side panel for all interpolation shapes (they were only for boxes before)
Expand Down
3 changes: 3 additions & 0 deletions cvat/apps/engine/media.mimetypes
Original file line number Diff line number Diff line change
Expand Up @@ -220,3 +220,6 @@ application/x-tarz tar.z
application/x-tzo tar.lzo
application/x-xz-compressed-tar txz
application/zip zip

# PDF
application/pdf pdf
60 changes: 60 additions & 0 deletions cvat/apps/engine/media_extractors.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,56 @@ def save_image(self, k, dest_path):
image.close()
return width, height

class PDFExtractor(MediaExtractor):
Copy link
Contributor

@nmanovic nmanovic Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would look at ArchiveExtractor implementation and inherit the class from DirectoryExtractor. Let's implement here _extract method ... What do you think?

Copy link
Contributor Author

@benhoff benhoff Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using VideoExtractor as a basis here because in my case, PDF's could have multiple pages. Is there a better way to handle multiple page PDF's?

def __init__(self, source_path, dest_path, image_quality, step=1, start=0, stop=0):
if not source_path:
raise Exception('No PDF found')

from pdf2image import convert_from_path
self._temp_directory = tempfile.mkdtemp(prefix='cvat-')
super().__init__(
source_path=source_path[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why source_path[0]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@benhoff benhoff Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why the source_path is a list here, but I believe due to the implementation, this will get a list with a single item in it. I didn't dive into the overall architecture for custom extractors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benhoff , you specified in description of pdf extractor that multiple pdf documents can be uploaded. For video extractor unique flag is True.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extractor's constructor always receive a list as source_path argument. I don't see any problem here, but the extractor is responsible to correctly handle passed source list. Could you please adjust description according to extractor behaviour? I mean case if you try to create task with several pdf files but only one will be used.

'pdf': {
  ...
  'unique': **False**
},

Maybe it will be better to change behaviour and pass to the constructor a list or single item according its description. I'll think about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed unique to True. For my case, PDF's can have multiple pages and I want to be able to flip through them, like a video. Let me know if you think the implementation should be different.

dest_path=dest_path,
image_quality=image_quality,
step=1,
start=0,
stop=0,
)

self._dimensions = []
file_ = convert_from_path(self._source_path)
self._basename = os.path.splitext(os.path.basename(self._source_path))[0]
for page_num, page in enumerate(file_):
output = os.path.join(self._temp_directory, self._basename + f'{page_num}' + '.jpg')
self._dimensions.append(page.size)
page.save(output, 'JPEG')

self._length = len(os.listdir(self._temp_directory))

def _get_imagepath(self, k):
img_path = os.path.join(self._temp_directory, self._basename + f'{k}' + '.jpg')
return img_path

def __iter__(self):
i = 0
while os.path.exists(self._get_imagepath(i)):
yield self._get_imagepath(i)
i += 1

def __del__(self):
if self._temp_directory:
shutil.rmtree(self._temp_directory)

def __getitem__(self, k):
return self._get_imagepath(k)

def __len__(self):
return self._length

def save_image(self, k, dest_path):
shutil.copyfile(self[k], dest_path)
return self._dimensions[k]

#Note step, start, stop have no affect
class DirectoryExtractor(ImageListExtractor):
def __init__(self, source_path, dest_path, image_quality, step=1, start=0, stop=0):
Expand Down Expand Up @@ -180,6 +230,10 @@ def _is_image(path):
def _is_dir(path):
return os.path.isdir(path)

def _is_pdf(path):
mime = mimetypes.guess_type(path)
return mime[0] == 'application/pdf'

# 'has_mime_type': function receives 1 argument - path to file.
# Should return True if file has specified media type.
# 'extractor': class that extracts images from specified media.
Expand Down Expand Up @@ -213,4 +267,10 @@ def _is_dir(path):
'mode': 'annotation',
'unique': False,
},
'pdf': {
'has_mime_type': _is_pdf,
'extractor': PDFExtractor,
'mode': 'annotation',
'unique': True,
},
}
1 change: 1 addition & 0 deletions cvat/requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,4 @@ djangorestframework==3.9.1
Pygments==2.3.1
drf-yasg==1.15.0
Shapely==1.6.4.post2
pdf2image==1.6.0