Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whether to forcibly enable repartitioning when the number of nebula space partitions is greater than 1 #71

Closed
df1-df1 opened this issue Mar 7, 2022 · 3 comments · Fixed by #102
Labels
doc affected PR: improvements or additions to documentation

Comments

@df1-df1
Copy link

df1-df1 commented Mar 7, 2022

I found a problem that resulted in the generated SST file containing only the key without the TagID

image

Desription: Accourding to struct of 3.0 vertex data:

image

If all goes well, when the Exchange program is finished, the SST file will contain data for both keys

    {
      name: tag-name-1
      type: {
        source: csv
        sink: sst
      }
      path: hdfs tag path 2

      fields: [csv-field-0, csv-field-1, csv-field-2]
      nebula.fields: [nebula-field-0, nebula-field-1, nebula-field-2]
      vertex: {
        field:csv-field-0
      }
      separator: ","
      header: true
      batch: 256
      partition: 32
      repartitionWithNebula: false
    }

However, if you follow the above configuration file, the generated SST files will only contain the key without the TagID

Here's why,the sst writer changes along with the partitioning information of the key, causing lower-ranked data in the same task to overwrite previous data(with same part)

https://github.com/DemocracyAndLiberty/nebula-exchange/blob/master/exchange-common/src/main/scala/com/vesoft/exchange/common/writer/FileBaseWriter.scala

        if (part != currentPart) {
          if (writer != null) {
            writer.close()
            val localFile = s"$localPath/$currentPart-$taskID.sst"
            HDFSUtils.upload(localFile,
                             s"$remotePath/${currentPart}/$currentPart-$taskID.sst",
                             namenode)
            Files.delete(Paths.get(localFile))
          }
          currentPart = part
          val tmp = s"$localPath/$currentPart-$taskID.sst"
          writer = new NebulaSSTWriter(tmp)
          writer.prepare()
        }

Accroding to https://github.com/DemocracyAndLiberty/nebula-exchange/blob/master/exchange-common/src/main/scala/com/vesoft/exchange/common/processor/Processor.scala,I noticed that setting repartitionWithNebula to true solved this problem when the number of nebula space partitions is greater than 1.

So whether to forcibly enable repartitioning when the number of nebula space partitions is greater than 1?

@wey-gu wey-gu added the doc affected PR: improvements or additions to documentation label Mar 8, 2022
@wey-gu
Copy link
Contributor

wey-gu commented Mar 8, 2022

Dear @DemocracyAndLiberty

Thanks a lot for your excellent analysis and suggestions!

  • The default value(false) could be revisited
    • what is the cost of turning repartitionWithNebula True ?
  • The impact of this value on v3.0.0(That repartitionWithNebula: False by default will result in losing tagID in SST file) should be highlighted in documentations

ref:

cc @Aiee @Sophie-Xie

@wey-gu
Copy link
Contributor

wey-gu commented May 7, 2022

We should revisit when repartitionWithNebula: False is OK to be used.

@Nicole00

@Minnull
Copy link

Minnull commented Sep 26, 2022

I found that there is a problem after turning on repartitioning: it affects the concurrency of tasks and the execution speed is limited

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc affected PR: improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants