Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV/TSV parsing issues #3

Open
warpech opened this issue Apr 9, 2013 · 5 comments
Open

CSV/TSV parsing issues #3

warpech opened this issue Apr 9, 2013 · 5 comments

Comments

@warpech
Copy link
Owner

warpech commented Apr 9, 2013

There are some issues when parsing strings copied from Excel that contain double quotes.

For example, when you write "A in one cell and "B in the next one, the text copied by Excel is "A "B. When you copy that to Notepad and then back to Excel, even Excel does not know what to do!

I have tried other JavaScript CSV parsing engines and they are also confused:

I haven't tried this one because it seems NodeJS only:

I am writing this issue because I want to get back to this at some point. Anyone cares to help? Pull request #2 was a good start but more work is requied apparently.

@redchair123
Copy link

So I did some digging with nodobjc on osx (https://npmjs.org/package/NodObjC). Basically, there are multiple types of clipboard data (and excel and friends add to multiple clipboard types).

On OSX at least, you can get/set the content for various types. You could theoretically load an image and the corresponding ALT text so that pasting on the command line gives you the text but pasting in photoshop gives you the image.

Sample sheet is https://www.dropbox.com/s/khrr6l0g7s3dho4/CopyPaste.xlsx and I tried copying all 8 cells

$ cat pboard.njs
var $ = require('nodobjc')
$.framework('AppKit')
var x = $.NSPasteboard('generalPasteboard')
console.log(String(x('stringForType',$.NSStringPboardType)).replace(/[\u000D]/g,"\n"))
console.log(x('stringForType',$.NSHTMLPboardType))
$ node pboard.njs

The TSV part looked fine (I replaced the windows line ending with a normal \n):

1581:36:00  [h]:mm:ss
94896:00    [mm]:ss
5693760 [ss]
09 PM 9 hh AM/PM h

Then there's the HTML monstrosity.

In this case, I was testing with various number formats and those are explicitly plugged in the style section:

.xl66
    {font-weight:700;
    mso-number-format:"hh\\ AM\/PM\\ h";}

I have a very basic component to handle those number formats (handles date formats but is missing some stuff), so those formats can be parsed explicitly. (https://github.com/SheetJS/ssf)

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 14">
<link id=Main-File rel=Main-File
href="file://localhost/Users/SheetJS/Library/Caches/TemporaryItems/msoclip/0/clip.htm">
<link rel=File-List
href="file://localhost/Users/SheetJS/Library/Caches/TemporaryItems/msoclip/0/clip_filelist.xml">
<style>
<!--table
    {mso-displayed-decimal-separator:"\.";
    mso-displayed-thousand-separator:"\,";}
@page
    {margin:1.0in .75in 1.0in .75in;
    mso-header-margin:.5in;
    mso-footer-margin:.5in;}
td
    {padding-top:1px;
    padding-right:1px;
    padding-left:1px;
    mso-ignore:padding;
    color:black;
    font-size:12.0pt;
    font-weight:400;
    font-style:normal;
    text-decoration:none;
    font-family:Calibri, sans-serif;
    mso-font-charset:0;
    mso-number-format:General;
    text-align:general;
    vertical-align:bottom;
    border:none;
    mso-background-source:auto;
    mso-pattern:auto;
    mso-protection:locked visible;
    white-space:nowrap;
    mso-rotate:0;}
.xl63
    {font-weight:700;
    mso-number-format:"\[h\]\:mm\:ss";}
.xl64
    {font-weight:700;
    mso-number-format:"\[mm\]\:ss";}
.xl65
    {font-weight:700;
    mso-number-format:"\[ss\]";}
.xl66
    {font-weight:700;
    mso-number-format:"hh\\ AM\/PM\\ h";}
-->
</style>
</head>

<body link=blue vlink=purple>

<table border=0 cellpadding=0 cellspacing=0 width=130 style='border-collapse:
 collapse;width:130pt'>
<!--StartFragment-->
 <col width=65 span=2 style='width:65pt'>
 <tr height=15 style='height:15.0pt'>
  <td height=15 class=xl63 align=right width=65 style='height:15.0pt;
  width:65pt'>1581:36:00</td>
  <td width=65 style='width:65pt'>[h]:mm:ss</td>
 </tr>
 <tr height=15 style='height:15.0pt'>
  <td height=15 class=xl64 align=right style='height:15.0pt'>94896:00</td>
  <td>[mm]:ss</td>
 </tr>
 <tr height=15 style='height:15.0pt'>
  <td height=15 class=xl65 align=right style='height:15.0pt'>5693760</td>
  <td>[ss]</td>
 </tr>
 <tr height=15 style='height:15.0pt'>
  <td height=15 class=xl66 align=right style='height:15.0pt'>09 PM 9</td>
  <td>hh AM/PM h</td>
 </tr>
<!--EndFragment-->
</table>

</body>

</html>

@redchair123
Copy link

I put up the code I used to grab that content in a separate repo: https://github.com/SheetJS/pb

@warpech
Copy link
Owner Author

warpech commented Apr 9, 2013

Exactly, also on Windows Excel knows more about the copied cells than there is transported over TSV in plain text clipboard. Unfortunately, in web browser I only have access to the text clipboard, AFAIK...

@redchair123
Copy link

So on chrome, you can look at event.clipboardData:

http://jsbin.com/uwuvan/1

Copy from a spreadsheet and paste into the box. Even though the content pasted is in text form, you can see both the text and the html version in the log

@redchair123
Copy link

This appears to be a far simpler exercise: just generate/parse the HTML output.

http://jsbin.com/uwuvan/5 shows you the HTML clipboard data in an iframe

As for loading data, the following seems to work:

$ echo "<table><tr><td>1</td><td>2</td></tr></table>" | pb -s html

(this populates values 1,2 in a row).

The right approach would be

  1. Add the logic that parses and generates HTML in this repo (make them separate, like htmlparse/htmlstringify)

  2. Add the logic to capture/set the html clipboard in the grid component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants