Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReaderError with ASCII-8BIT encoding. #142

Closed
jcoyne opened this issue Dec 16, 2013 · 2 comments
Closed

ReaderError with ASCII-8BIT encoding. #142

jcoyne opened this issue Dec 16, 2013 · 2 comments

Comments

@jcoyne
Copy link
Contributor

jcoyne commented Dec 16, 2013

This code works:

data = "<info:fedora/scholarsphere:qv33rx50r> <http://purl.org/dc/terms/description> \"\\n\xE2\x80\x99 \" .\n"

repository = RDF::Repository.new
RDF::Reader.for(:ntriples).new(data)  do |reader|
  reader.each_statement do |statement|
    repository << statement
  end
end

repository.dump(:ntriples)

This causes an error:

data = "<info:fedora/scholarsphere:qv33rx50r> <http://purl.org/dc/terms/description> \"\\n\xE2\x80\x99 \" .\n".force_encoding('ASCII-8BIT')

repository = RDF::Repository.new
RDF::Reader.for(:ntriples).new(data)  do |reader|
  reader.each_statement do |statement|
    repository << statement
  end
end

repository.dump(:ntriples)
RDF::ReaderError: expected object in line 1: "\"\n’ \" ."
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/reader.rb:419:in `fail_object'
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/ntriples/reader.rb:212:in `block in read_triple'
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/ntriples/reader.rb:204:in `loop'
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/ntriples/reader.rb:204:in `read_triple'
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/reader.rb:382:in `read_statement'
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/reader.rb:299:in `block in each_statement'
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/reader.rb:299:in `loop'
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/reader.rb:299:in `each_statement'
    from (irb):224:in `block in irb_binding'
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/reader.rb:202:in `call'
    from /Users/justin/.rbenv/versions/2.0.0-p353/lib/ruby/gems/2.0.0/gems/rdf-1.1.0.1/lib/rdf/reader.rb:202:in `initialize'
    from (irb):223:in `new'
    from (irb):223
@gkellogg
Copy link
Member

This can be simplified to the following:

data = "<info:fedora/scholarsphere:qv33rx50r> <http://purl.org/dc/terms/description> \"they\\r\\ncapture design\xE2\x80\x99s \" .\n".force_encoding('ASCII-8BIT')
repo = RDF::Repository.new << RDF::NTriples::Reader.new(data)
r.dump(:ntriples)

The problem is in RDF::NTriples::Reader#read_literal, which is failing when it tries to match on LITERAL_PLAIN. Must be some other place where the input is not transformed to UTF-8, which is certainly annoying.

jcoyne added a commit to samvera/active_fedora that referenced this issue Dec 16, 2013
The RDF reader chokes on certain character combinations if the raw data
is not encoded UTF-8.  See ruby-rdf/rdf#142
Fedora does not store character encoding, so by default they come back as
ASCII-8BIT.
jcoyne added a commit to samvera/active_fedora that referenced this issue Dec 16, 2013
The RDF reader chokes on certain character combinations if the raw data
is not encoded UTF-8.  See ruby-rdf/rdf#142
Fedora does not store character encoding, so by default they come back as
ASCII-8BIT.
@gkellogg
Copy link
Member

The problem seems to be that the object is not valid ASCII-8BIT. The #force_encoding masks this, but if you use #encode instead, you get an undefined conversion error. What seems to be tripping up is the transformation between formats of an illegal format.

As you note in your patch, always giving RDF.rb UTF-8 is probably the right thing. In any case, RDF Literals only cover Unicode (see RDF Concepts), so giving them something else is sort of non-sensical.

jcoyne added a commit to samvera/active_fedora that referenced this issue Dec 17, 2013
The RDF reader chokes on certain character combinations if the raw data
is not encoded UTF-8.  See ruby-rdf/rdf#142
Fedora does not store character encoding, so by default they come back as
ASCII-8BIT.
jcoyne added a commit to samvera/active_fedora that referenced this issue Dec 17, 2013
The RDF reader chokes on certain character combinations if the raw data
is not encoded UTF-8.  See ruby-rdf/rdf#142
Fedora does not store character encoding, so by default they come back as
ASCII-8BIT.
jcoyne added a commit to samvera/active_fedora that referenced this issue Dec 17, 2013
The RDF reader chokes on certain character combinations if the raw data
is not encoded UTF-8.  See ruby-rdf/rdf#142
Fedora does not store character encoding, so by default they come back as
ASCII-8BIT.
barmintor pushed a commit to barmintor/active_fedora that referenced this issue Apr 8, 2014
The RDF reader chokes on certain character combinations if the raw data
is not encoded UTF-8.  See ruby-rdf/rdf#142
Fedora does not store character encoding, so by default they come back as
ASCII-8BIT.
cbeer pushed a commit to samvera-labs/active_fedora-datastreams that referenced this issue Aug 18, 2016
The RDF reader chokes on certain character combinations if the raw data
is not encoded UTF-8.  See ruby-rdf/rdf#142
Fedora does not store character encoding, so by default they come back as
ASCII-8BIT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants