-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to perform deduplication in a cluster environment? #759
Comments
Hi, @Dzg0309 . You should upload your data to a filesystem like S3 and use xorbits interfaces like read_csv, read_parquet (According to your data) to read it, then use deduplication operation to complete what you want.
|
Is it necessary to use the s3 file system? Or you can use the hadoop file system, or you can read the data files directly from the local. I tried putting all the data under the supervisor, and then when starting the cluster, other workers would report file not found errors. When the data is split to all nodes (the data directories of each node remain consistent), there will be a problem that the file does not exist during the operation, which makes me very confused. Below is my code: import xorbits
import xorbits.pandas as pd
from xorbits.experimental import dedup
import xorbits.datasets as xdatasets
xorbits.init(address='http://xx.xx.xxx.xx:xxxx',
session_id = 'xorbits_dedup_test_09',
)
ds = xdatasets.from_huggingface("/passdata/xorbits_data/mnbvc_test", split='train', cache_dir='/passdata/.cache')
df = ds.to_dataframe()
res = dedup(df, col="content", method="minhash", threshold=0.7, num_perm=128, min_length=5, ngrams=5, seed=42) # for 'minhash' method
res.to_parquet('/passdata/xorbits_data/output')
xorbits.shutdown() |
If the data is in a If your data is in csv or parquet format, you can try the |
ok I tried to copy the data to each node and it worked, but at the same time two other problems occurred:
|
|
When setting up the cluster environment, I want to run a deduplication task for a large data set (1T, stored locally), but how should I load the data? Should I put all the data on the supervisor node and then load it? Or should we divide the data equally into each node, and then run xorbits.init(address=http://supervisor_ip:web_port) on the supervisor node to load all the node data for deduplication? Please answer it, thank you~
The text was updated successfully, but these errors were encountered: