Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hxl2pandas: Pandas DataFrame #4

Open
fititnt opened this issue Jan 26, 2021 · 0 comments
Open

hxl2pandas: Pandas DataFrame #4

fititnt opened this issue Jan 26, 2021 · 0 comments
Labels
data-transformation https://en.wikipedia.org/wiki/Data_transformation

Comments

@fititnt
Copy link
Member

fititnt commented Jan 26, 2021

hxl +public  
meta +status working-draft
meta +discussion+public  
meta +id EticaAI-Data_HXL-Data-Science-file-formats_Pandas
meta +hxlproxy +url https://proxy.hxlstandard.org/data?dest=data_view&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY%2Fedit%23gid%3D723336363
meta +specification +url https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes
meta +seealso +url https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
meta +description +i_eng Important point: both the hxl2pandas and theEticaAI-Data_HXL-Data-Science-file-formats_Pandas reference tableare mostly as reference of how pandas (more specifically DataFrame)could be used as an intermediate format to export HXL to other formatsalready supported by Pandas.While the reference table may still be useful for those who are doingmanual conversionor to help understand how different tools used fordata mining / machine learningwould use HXL attributes, the hxl2pandasmay not be implemented at all. Alsosome of the intermediate formats maybe converted using other libraries.

At this moment I'm not 100% sure if using pandas just because it allows to export to several formats may be a good approach.

Fist, there is a problem with overhead (but this alone is not main reason). But if the underlining libraries could eventually allow store some additional metadata (like to be able to reconstruction the source hashtags, could would be very nice to have.

The overhead start to become a problem if is 100% granted that the DataFrame loads everything on memory (even if is just numerical representation of strings) before save the formats. While this still more efficient than like load entire Excel file or CSVs, I think that if someone would be using this to convert from an huge CSV, it would be acceptable to be slower, like first save to an local file on /tmp, and then convert the HXLated CSV using the header as additional instructions for whatever would be the new format and use the most efficient loader as possible.

Anyway, this if have to focus, the strategies that generate file formats that do have friendly interfaces (like Orange and Weka; both may not require any command line commands at all to use) seems more an win-win over formats that the end user could simply consume CSVs directly. But these advanced cases can still serve as reference on how to choose the attributes and not just consider two applications (Orange and Weka).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-transformation https://en.wikipedia.org/wiki/Data_transformation
Projects
None yet
Development

No branches or pull requests

1 participant