Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

blobxfer broken Azure File download #255

Closed
veonua opened this issue Dec 23, 2018 · 2 comments
Closed

blobxfer broken Azure File download #255

veonua opened this issue Dec 23, 2018 · 2 comments

Comments

@veonua
Copy link

veonua commented Dec 23, 2018

Problem Description

having 140 files to OCR, 57 of them seems to be not downloaded fully
and OCR fails with

Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
Error in readHeaderMemJp2k: image parameters not found
Error in pixReadStreamJp2k: failed to read the header
Error in pixReadStream: jp2: no pix returned
Error in pixRead: pix not read
Error during processing.

Batch Shipyard Version

3.6.1

Expected Results

files downloaded fully

Actual Results

only parts of files copied

Redacted Configuration

jobs

job_specifications:
- id: ocr
  tasks:  
  - docker_image: tesseractshadow/tesseract4re
    
    task_factory:
        file:
          azure_storage:
            storage_account_settings: mystorageaccount
            remote_path: test
            is_file_share: true
            include:
            - '*_1.jpe'
            
          task_filepath: file_name

    command: /bin/bash -c "install -Dv /dev/null {file_path} | tesseract {file_name} {file_path}"
    output_data:
      azure_storage:
      - storage_account_settings: mystorageaccount
        remote_path: test
        local_path: $AZ_BATCH_TASK_WORKING_DIR/
        is_file_share: true
        include:
        - "*.txt"

  merge_task:
    docker_image: python:3.7-alpine3.7
    input_data:
      azure_storage:
      - storage_account_settings: mystorageaccount
        remote_path: test
        is_file_share: true
        blobxfer_extra_options: '--strip-components 2'
    command: /bin/sh -c "cat ./*/*/*/*.txt > results.txt"
    output_data:
      azure_storage:
      - storage_account_settings: mystorageaccount
        remote_path: output/results
        is_file_share: true
        local_path: $AZ_BATCH_TASK_WORKING_DIR/results.txt

config

batch_shipyard:
  storage_account_settings: mystorageaccount 
global_resources:
  docker_images:
  - tesseractshadow/tesseract4re
  - python:3.7-alpine3.7

pool

pool_specification:
  id: poolf3234
  virtual_network:
    arm_subnet_id: /subscriptions/82a0c17e-006b-470a-967d-f5f4096fe264/resourceGroups/rdtestenv-rg/providers/Microsoft.Network/virtualNetworks/rdtestenv-vnet3/subnets/labvm-subnet3
    
  vm_configuration:
    platform_image:
      offer: UbuntuServer
      publisher: Canonical
      sku: 18.04-LTS

  vm_count:
    dedicated: 1
    low_priority: 0
  vm_size: STANDARD_D1_V2
  ssh:
    username: shipyard

Additional Logs

stdout

2018-12-23 21:21:02.354 INFO - 
============================================
         Azure blobxfer parameters
============================================
         blobxfer version: 1.5.5
                 platform: Linux-4.15.0-1035-azure-x86_64-with
               components: CPython=3.6.6-64bit azstor.blob=1.4.0 azstor.file=1.4.0 crypt=2.4.1 req=2.20.1
       transfer direction: Azure -> local
                  workers: disk=4 xfer=3 md5=0 crypto=0
                 log file: None
                  dry run: False
              resume file: None
                  timeout: connect=10 read=200 max_retries=1000
                     mode: StorageModes.File
                  skip on: fs_match=False lmt_ge=False md5=False
        delete extraneous: False
                overwrite: True
                recursive: True
            rename single: True
         chunk size bytes: 0
         strip components: 0
         compute file md5: False
       restore properties: attr=False lmt=False
          rsa private key: None
        local destination: /mnt/batch/tasks/workitems/ocrjcdssadww/job-1/task-00000/wd/4690158_1.jpe
============================================
2018-12-23 21:21:02.357 INFO - blobxfer start time: 2018-12-23 21:21:02.357239+00:00
2018-12-23 21:21:02.388 DEBUG - dest is_dir=False for 1 specs
2018-12-23 21:21:02.389 INFO - downloading blobs/files to local path: /mnt/batch/tasks/workitems/ocrjcdssadww/job-1/task-00000/wd/4690158_1.jpe
2018-12-23 21:21:02.389 DEBUG - spawning 3 transfer threads
2018-12-23 21:21:02.415 DEBUG - spawning 4 disk threads
2018-12-23 21:21:02.628 INFO - MD5: SKIPPED, test/DN/invoices/998/4670117_1.tif None <L..R> None
2018-12-23 21:21:02.696 INFO - MD5: SKIPPED, test/DN/invoices/998/4670117_2.tif None <L..R> None
2018-12-23 21:21:02.779 DEBUG - 0 files 0.0000 MiB filesize and/or lmt_ge skipped
2018-12-23 21:21:02.780 DEBUG - 21 remote files processed, waiting for download completion of approx. 0.6656 MiB
2018-12-23 21:21:02.850 ERROR - exceptions encountered while downloading
2018-12-23 21:21:02.850 ERROR - PosixPath('/mnt/batch/tasks/workitems/ocrjcdssadww/job-1/task-00000/wd/4690158_1.jpe')
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/blobxfer-1.5.5-py3.6.egg/blobxfer/operations/download.py", line 873, in start
    self._run()
  File "/usr/lib/python3.6/site-packages/blobxfer-1.5.5-py3.6.egg/blobxfer/operations/download.py", line 833, in _run
    raise self._exceptions[0]
  File "/usr/lib/python3.6/site-packages/blobxfer-1.5.5-py3.6.egg/blobxfer/operations/download.py", line 494, in _worker_thread_transfer
    self._process_download_descriptor(dd)
  File "/usr/lib/python3.6/site-packages/blobxfer-1.5.5-py3.6.egg/blobxfer/operations/download.py", line 584, in _process_download_descriptor
    self._transfer_cc[dd.final_path] -= 1
KeyError: PosixPath('/mnt/batch/tasks/workitems/ocrjcdssadww/job-1/task-00000/wd/4690158_1.jpe')

Additonal Comments

original file is jpeg 102.2kB, copied 37.2kB of some buffer

@veonua
Copy link
Author

veonua commented Dec 24, 2018

cp /mnt/batch/tasks/mounts/azfile-storage-test/{file_path} image
produces good file. while

bloxfer copy some trash

cp $AZ_BATCH_NODE_SHARED_DIR/test/{file_path} image - file not found

batch_shipyard:
  storage_account_settings: mystorageaccount 
global_resources:
  docker_images:
  - tesseractshadow/tesseract4re
  - python:3.7-alpine3.7
  volumes:
    shared_data_volumes:
      azurefile_vol:
        volume_driver: azurefile
        storage_account_settings: mystorageaccount
        azure_file_share_name: test
        container_path: $AZ_BATCH_NODE_SHARED_DIR/test
        mount_options:
        - file_mode=0777
        - dir_mode=0777
        bind_options: rw

@alfpark
Copy link
Collaborator

alfpark commented Jan 10, 2019

This will be fixed when the blobxfer issue is resolved. As a workaround, mount the Azure File share as a shared_data_volume and directly copy.

@alfpark alfpark changed the title it seems blobxfer does not copy file completely blobxfer broken Azure File download Jan 10, 2019
@alfpark alfpark added the defect label Jan 15, 2019
@alfpark alfpark closed this as completed Feb 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants