Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source name changed if use more than one csv file #89

Closed
bartek5186 opened this issue Jul 18, 2021 · 9 comments · Fixed by #93
Closed

source name changed if use more than one csv file #89

bartek5186 opened this issue Jul 18, 2021 · 9 comments · Fixed by #93
Labels
bug Something isn't working

Comments

@bartek5186
Copy link

bartek5186 commented Jul 18, 2021

If import multiple csv files, the source name automatically changed to csv (note: in files source column is properly named to custom).

source, popularity, layer, id, lat, lon, name, postalcode, country, name_json
custom_name, 1, layer_name, .... etc

source has name csv in imported data

Problem doesnt exist if i use one csv file.

image
Example:
image

@bartek5186 bartek5186 added the bug Something isn't working label Jul 18, 2021
@missinglink
Copy link
Member

I am not able to reproduce this, using the example you posted in the linked issue:

[
  Document {
    name: { default: 'Grudziądz' },
    phrase: { default: 'Grudziądz' },
    parent: {},
    address_parts: { zip: '86-300' },
    center_point: { lon: 18.76077, lat: 53.468958 },
    category: [],
    addendum: {},
    source: 'bdp',
    layer: 'postalcode',
    source_id: '71ff447b-972b-4f7d-a8c1-e0c8c02a1a19',
    popularity: 100
  }
]

bdp:postalcode:71ff447b-972b-4f7d-a8c1-e0c8c02a1a19

It may be something to do with how your CSV is encoded, can you please upload a copy of the file you're using (either the whole thing or at least with the header line and first line).

@missinglink
Copy link
Member

agh sorry, you're saying this is only triggered when using multiple files... let me check that

@bartek5186
Copy link
Author

bartek5186 commented Nov 30, 2021

agh sorry, you're saying this is only triggered when using multiple files... let me check that

I can send 2 files if it will not work on normal "multiple" case.

@missinglink
Copy link
Member

I'm not able to reproduce this, please send me the files you're using:

cat data/example2.csv
source,popularity,layer,id,lat,lon,name,postalcode,country,name_jso
bdp,100,postalcode,71ff447b-972b-4f7d-a8c1-e0c8c02a1a19,53.468958363988,18.760770296251,Grudziądz,86-300,PL,"[""86-300"", "" 86-301"", "" 86-302"", "" 86-303"", "" 86-304"", "" 86-305"", "" 86-306"", "" 86-307"", "" 86-308"", "" 86-309"", "" 86-310"", "" 86-311""]"
diff --git a/test/streams/recordStream.js b/test/streams/recordStream.js
index 1fdd378..3be6427 100644
--- a/test/streams/recordStream.js
+++ b/test/streams/recordStream.js
@@ -144,6 +144,45 @@ tape(
   }
 );
 
+const path = require('path');
+
+/**
+ * Test the full recordStream pipeline from input CSV to output pelias-model
+ *
+ * issue: https://github.com/pelias/csv-importer/issues/89
+ */
+tape(
+  'import multiple CSV files and test the results.',
+  function (test) {
+    const fixtures = [
+      path.resolve(`${__dirname}/../../data/example.csv`),
+      path.resolve(`${__dirname}/../../data/example2.csv`),
+    ]
+    const dataStream = recordStream.create(fixtures);
+    test.ok(dataStream.readable, 'Stream is readable.');
+
+    // read documents into array
+    const docs = []
+    const testStream = through.obj((doc, enc, next) => {
+      docs.push(doc);
+      next();
+    });
+
+    // test assertions
+    testStream.on('finish', () => {
+      test.ok(docs.length, 'total of 4 valid documents');
+      test.ok(docs[0].getGid() === 'pelias:example_layer:1');
+      test.ok(docs[1].getGid() === 'pelias:address:2');
+      test.ok(docs[2].getGid() === 'pelias:with_custom_data:4');
+      test.ok(docs[3].getGid() === 'bdp:postalcode:71ff447b-972b-4f7d-a8c1-e0c8c02a1a19');
+      test.end();
+    });
+
+    // run
+    dataStream.pipe(testStream);
+  }
+);
+
 tape( 'getIdPrefix returns prefix based on irectory structure', function( test ) {
   var filename = '/base/path/us/ca/san_francisco.csv';
   var basePath = '/base/path';

@bartek5186
Copy link
Author

bartek5186 commented Nov 30, 2021

bdp_pl.csv
bdp_de.csv
bdp_uk.csv

config i have setup like that:
image
and API
image

@missinglink
Copy link
Member

missinglink commented Nov 30, 2021

Hmmm.. okay so I don't think this is to do with having multiple files.

I'm about to finish for the day but It appears that the issue is with three leading bytes being present in the files you provided.

My suspicion is that these are a Byte Order Mark.

A BOM is not required for utf-8 encoding, so I'm not sure why it's there.
We should be able to detect these and ignore them, although if your data is in utf-16 then converting it should also remove the BOM

head -n1 data/bdp_de.csv | xxd -b
00000000: 11101111 10111011 10111111 01110011 01101111 01110101  ...sou
00000006: 01110010 01100011 01100101 00101100 01110000 01101111  rce,po
0000000c: 01110000 01110101 01101100 01100001 01110010 01101001  pulari
00000012: 01110100 01111001 00101100 01101100 01100001 01111001  ty,lay
00000018: 01100101 01110010 00101100 01101001 01100100 00101100  er,id,
0000001e: 01101100 01100001 01110100 00101100 01101100 01101111  lat,lo
00000024: 01101110 00101100 01101110 01100001 01101101 01100101  n,name
0000002a: 00101100 01110000 01101111 01110011 01110100 01100001  ,posta
00000030: 01101100 01100011 01101111 01100100 01100101 00101100  lcode,
00000036: 01100011 01101111 01110101 01101110 01110100 01110010  countr
0000003c: 01111001 00101100 01101110 01100001 01101101 01100101  y,name
00000042: 01011111 01101010 01110011 01101111 01101110 00001101  _json.
00000048: 00001010  

head -n1 data/bdp_de.csv | xxd
00000000: efbb bf73 6f75 7263 652c 706f 7075 6c61  ...source,popula
00000010: 7269 7479 2c6c 6179 6572 2c69 642c 6c61  rity,layer,id,la
00000020: 742c 6c6f 6e2c 6e61 6d65 2c70 6f73 7461  t,lon,name,posta
00000030: 6c63 6f64 652c 636f 756e 7472 792c 6e61  lcode,country,na
00000040: 6d65 5f6a 736f 6e0d 0a                   me_json..                                             .

@missinglink
Copy link
Member

From the wiki:

The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF.

So yeah, thats exactly what it is 11101111 10111011 10111111

@bartek5186
Copy link
Author

Yes, on server i have too this
image

@missinglink
Copy link
Member

proposed fix #93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants