Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load sharded data from files without filetype extensions #573

Open
mhlr opened this issue Sep 13, 2017 · 8 comments · May be fixed by #574
Open

Unable to load sharded data from files without filetype extensions #573

mhlr opened this issue Sep 13, 2017 · 8 comments · May be fixed by #574

Comments

@mhlr
Copy link

mhlr commented Sep 13, 2017

I am trying to load data from directory of jsonlines formatted files which lack the .json extension.

I have tried:

data('/path/to/dir/')
data('/path/to/dir/*')
data(JSONLines('/path/to/dir/'))
data(JSONLines('/path/to/dir/*'))
data(Directory(JSONLines)('/path/to/dir/'))
data(Directory(JSONLines)('/path/to/dir/*'))

all of which throw either Unable to parse uri to data resource or No such file or directory.

I am able to parse a single file with:

data(JSONLines('/path/to/dir/file1'))

Is this a bug / unimplemented functionality or am I doing something wrong?

@llllllllll
Copy link
Member

When using just data, blaze delegates to odo.resource which uses a sequence of regular expressions to resolve the uri to a type. If there is no extension, you will need to manually construct the box type (for example JSONLines) so odo and blaze know what the uri is.

My intuition is that data(Directory(JSONLines)('/path/to/dir/')) is the correct call, does that produce No such file or directory? If so, can you confirm that the path actually exists? Also, maybe try removing the trailing slash. If the trailing slash fixes the problem, that is certainly a bug.

@mhlr
Copy link
Author

mhlr commented Sep 13, 2017

@llllllllll

I have the files

/home/dm/wikipedia/AA/wiki_00
...
/home/dm/wikipedia/AA/wiki_99

When I run

d = data(Directory(JSON)('/home/dm/wikipedia/AA/'))

I get

NotImplementedError                       Traceback (most recent call last)
<ipython-input-1-a1471c617821> in <module>()
----> 1 import codecs, os;__pyfile = codecs.open('''/tmp/py7956cPk''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();os.remove('''/tmp/py7956cPk''');exec(compile(__code, '''/home/dm/Scripts/vndf.py''', 'exec'));

/home/dm/Scripts/vndf.py in <module>()
     14 d = data(Directory(JSONLines)('/home/dm/wikipedia/AA/wiki_01'))
     15 
---> 16 #d = data(Directory(JSONLines)('/home/dm/wikipedia/AA/'))
     17 
     18 #d = data(Directory(JSONLines)('/home/dm/wikipedia/AA/.*'))

/home/dm/anaconda3/lib/python3.6/site-packages/blaze/interactive.py in data(data_source, dshape, name, fields, schema, **kwargs)
    151         dshape = datashape.dshape(dshape)
    152     if not dshape:
--> 153         dshape = discover(data_source)
    154         types = None
    155         if isinstance(dshape.measure, Tuple) and fields:

/home/dm/anaconda3/lib/python3.6/site-packages/multipledispatch/dispatcher.py in __call__(self, *args, **kwargs)
    162             self._cache[types] = func
    163         try:
--> 164             return func(*args, **kwargs)
    165 
    166         except MDNotImplementedError:

/home/dm/anaconda3/lib/python3.6/site-packages/odo/directory.py in discover_Directory(c, **kwargs)
     48 @discover.register(_Directory)
     49 def discover_Directory(c, **kwargs):
---> 50     return var * discover(first(c)).subshape[0]
     51 
     52 

/home/dm/anaconda3/lib/python3.6/site-packages/toolz/itertoolz.py in first(seq)
    366     'A'
    367     """
--> 368     return next(iter(seq))
    369 
    370 

/home/dm/anaconda3/lib/python3.6/site-packages/odo/directory.py in <genexpr>(.0)
     32     def __iter__(self):
     33         return (resource(os.path.join(self.path, fn), **self.kwargs)
---> 34                     for fn in sorted(os.listdir(self.path)))
     35 
     36 

/home/dm/anaconda3/lib/python3.6/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
     89 
     90     def __call__(self, s, *args, **kwargs):
---> 91         return self.dispatch(s)(s, *args, **kwargs)
     92 
     93     @property

/home/dm/anaconda3/lib/python3.6/site-packages/odo/resource.py in resource_all(uri, *args, **kwargs)
     98     discover
     99     """
--> 100     raise NotImplementedError("Unable to parse uri to data resource: " + uri)
    101 
    102 

NotImplementedError: Unable to parse uri to data resource: /home/dm/wikipedia/AA/wiki_00

Note that the error message contains the name of a specific file, so blaze is seeing the directoruy and the files therein. It is just getting confused somehow.

@mhlr
Copy link
Author

mhlr commented Sep 13, 2017

@llllllllll

Leaving of the final '/' makes no difference.

@llllllllll
Copy link
Member

Ah, it looks like Directory isn't respecting that you have told it the type of the resource already:

/home/dm/anaconda3/lib/python3.6/site-packages/odo/directory.py in <genexpr>(.0)
     32     def __iter__(self):
     33         return (resource(os.path.join(self.path, fn), **self.kwargs)
---> 34                     for fn in sorted(os.listdir(self.path)))
     35 
     36 

this is the incorrect frame.
The basic idea is that we need to treat "bound" Directory subclasses different in __iter__. It is pretty late here but I should be able to fix this tomorrow.

@mhlr
Copy link
Author

mhlr commented Sep 13, 2017

@llllllllll Thanks

@mhlr
Copy link
Author

mhlr commented Sep 13, 2017

What would be the way to supply type information when using a file pattern rather than the whole directory, eg.:

data('/home/dm/wikipedia/AA/wiki_0*')

@llllllllll
Copy link
Member

The call I showed before is the correct way to do it, it is just broken. I'm not sure there is a simple workaround other than adding an extension. This should be a small fix though.

I am about to go to sleep but I'll fix this tomorrow.

@mhlr
Copy link
Author

mhlr commented Sep 14, 2017

@llllllllll Cool, Thanks! I think this is not json specific though.
I have tried thing like Directory(TextFile)) and also Directory(Directory(JsonLines)) pointed at the parent directory and both exhibit the same problem. That makes me think that it is primarily a Directory problem.
I the second case it the error was about the inner directory, it did not reach through to the file before failing.
I wonder a similar problem also affects some of the other modifiers like S3, SSH and HDFS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants