-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Spark+Petastorm with Sqlite+SqlAlchemy #445
Conversation
if len(data_batch) * row_size >= batch_mem_size: | ||
yield Batch(pd.DataFrame(data_batch)) | ||
data_batch = [] | ||
if data_batch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a pretty print function for a Batch as well
for col in columns: | ||
if col.type == ColumnType.NDARRAY: | ||
dict_row[col.name] = self._serializer.serialize(dict_row[col.name]) | ||
elif isinstance(dict_row[col.name], (np.generic,)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the ColumnType
for this case? It seems that there are only TEXT, INTERGER and FLOAT left. In which case the type will be np.generic
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. np.generic is the base class for all numpy types, and to_list
is a general approach to convert them to python types. https://stackoverflow.com/a/53067954, Maybe update the comment. I was initially confused by the tolist()
dict_row[col.name] = sql_row[idx] | ||
return dict_row | ||
|
||
def create(self, table: DataFrameMetadata, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen if the table is already existed? This raised an issue that I did not notice before. I was using if_not_exists
in the load executor. In mat executor, we can use handle_if_not_exists
because the catalog entry is created in the mat executor. But this is not the case for the load operator, the catalog entry is created before the load executor. I think we shall choose one way for both of them. And update the mat, load design and this create
(e.g., doc what happens the table exists, remove **kwargs
, add a call to check whether table exists or can we use read to do it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you raise it as an issue? We can fix it later as I want to get rid of Pyspark asap
No description provided.