Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[File] recognize encoding for remote resource #37

Open
2 tasks
AcckiyGerman opened this issue Feb 27, 2018 · 4 comments
Open
2 tasks

[File] recognize encoding for remote resource #37

AcckiyGerman opened this issue Feb 27, 2018 · 4 comments
Assignees
Milestone

Comments

@AcckiyGerman
Copy link
Contributor

AcckiyGerman commented Feb 27, 2018

Original: datopian/datahub-qa#105

File = require('data.js').File
// loading ISO8859 resource:
> file = File.load('https://raw.githubusercontent.com/frictionlessdata/test-data/master/files/csv/encodings/iso8859.csv')
> file.encoding
'utf-8'

Acceptance criteria

  • File.load('https://raw.githubusercontent.com/frictionlessdata/test-data/master/files/csv/encodings/iso8859.csv').encoding == 'ISO-8859-1'
  • File.load('https://raw.githubusercontent.com/frictionlessdata/test-data/master/files/csv/encodings/western-macos-roman.csv').encoding == <macOS-roman-or-so>

Tasks

  • add test
  • realize encoding recognize

Analysis

We need to change this method:

class FileRemote extends File {
   ...
   get encoding() {
       return DEFAULT_ENCODING
  }

analysis update

encoding() method should:

  • connect to remote resource
  • get small portion of raw-data
  • try to recognize encoding

I tried to implement this schema, using chardet.detectFileSync() lib but it works only with files - any argument is treated as a file-name.

Possible solutions:

  • save a part of remote resource in a local temp file, then use chardet.detectFileSync(temp)
  • use some other lib to recognize encoding using remote Stream
@AcckiyGerman
Copy link
Contributor Author

AcckiyGerman commented Mar 2, 2018

gist

const guessRemoteEncoding = async ()=>{
      let stream
      try {
        stream = await this.stream()
      } catch (err){
        console.warn('Warning! Cannot reach remote file to guess encoding:\n', this.path)
        return
      }

      // Download one piece of remote file on the disk and use 'chardet' lib to guess encoding:
      stream.on('data', chunk =>{
        stream.pause()  // one chunk is enough
        const tmpFileName = 'tmp' + Math.random()
        fs.writeFile(tmpFileName, chunk, ()=>{
          let encoding = chardet.detectFileSync(tmpFileName)
          fs.unlinkSync(tmpFileName)
          this._descriptor.encoding = encoding   // set FileRemote object encoding
          this._encoding = encoding  // set this._encoding for future responses
        })
      })
    }

    if (!this._encoding) {
      guessRemoteEncoding()
    }

@zelima
Copy link
Contributor

zelima commented Mar 8, 2018

@AcckiyGerman What is the status for this one?

@AcckiyGerman
Copy link
Contributor Author

I had no time to work with this, but I still want to finish it.

@AcckiyGerman
Copy link
Contributor Author

AcckiyGerman commented Mar 13, 2018

Uploaded a bigger western-macos-roman encoded file:
https://github.com/frictionlessdata/test-data/blob/master/files/csv/encodings/big-western-macos-roman.csv

Size ~24kb.

Try to recognize with different sniffer libs:

  • chardet (originally used in data.js): FAIL (windows-1252)
  • html-encoding-sniffer: FAIL (windows-1252)
  • Python version of chardet: FAIL (no result)

Conclusion:

MacOS western roman encoding - not recognized

No libs that I found are recognising the MacOS encoding properly. I know that this recognise is possible, coz my LibreOffice open this file, recognizing it as macos-roman.

iso8859 encoding - OK

  • chardet works OK even with small samples.

Issue is frozen, as we have other priorities now

@zelima zelima modified the milestones: Sprint - 26 Mar 2018, Backlog Mar 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants