Change the mapper output directory from $TMP/shards to $TMP/map_output #3960

gitlw · 2019-09-11T00:51:36Z

In #3959 , bulk loader crashes when trying to move a directory into itself with a new name
/dgraph/tmp/shards/shard_0
/dgraph/tmp/shards/shard_0/shard_0

The bulk loader logic is

the mapper produce output as
.../tmp/shards/000
.../tmp/shards/001
read the list of shards under .../tmp/shards/
create the reducer shards as
.../tmp/shards/shard_0
.../tmp/shards/shard_1
move the list read in step 2 into the reducer shards created in step 3

Though I cannot reproduce the problem, but it seems creating of the reducer shard directory .../tmp/shards/shard_0 and listing all the mapper shards in step 2 are re-ordered. Something similar is mentioned in etcd-io/etcd#6368

This PR avoids such possibilities by putting the mapper output into an independent directory
../tmp/map_output, so that the program works correctly even if the reordering happens.

This change is

pullrequest

✅ A review job has been created and sent to the PullRequest network.

@gitlw you can click here to see the review status or cancel the code review job.

pullrequest

Thanks for the detailed context via the PR description. Going through, the PR seems to fix the issue as a workaround as mentioned. In terms of the actual big, it's still there if someone decided in the future to do something similar - so I would suggest adding a comment that the root issue still exists/but you were unable to reproduce.

Reviewed with ❤️ by PullRequest

ashish-goswami

Reviewable status: 0 of 3 files reviewed, all discussions resolved (waiting on @manishrjain)

mangalaman93 · 2019-09-11T17:01:33Z

I have a question. Wouldn't it make more sense to run fsync after creating or moving directories/files? That will guarantee that operations are executed in order.

gitlw · 2019-09-11T22:35:25Z

@mangalaman93 I think fsync would help with other types of problems.
For instance, if we are creating a subdirectory and then try to list all children under the parent, and we cannot find the newly created child, fsync could potentially help.
But in this case, fsync would not help because we are receiving a child directory which should not show up.

martinmr

Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @gitlw and @manishrjain)

dgraph/cmd/bulk/mapper.go, line 92 at r1 (raw file):

	filename := filepath.Join(
		m.opt.TmpDir,
		"map_output",

perhaps make this as constant.

dgraph/cmd/bulk/merge_shards.go, line 30 at r1 (raw file):

func mergeMapShardsIntoReduceShards(opt options) {
	mapShards := shardDirs(opt.TmpDir + "/map_output")

use filepaht.Join to make this logic more robust

dgraph/cmd/bulk/reduce.go, line 48 at r1 (raw file):

func (r *reducer) run() error {
	dirs := shardDirs(r.opt.TmpDir + "/shards")

also use filepath.Join here

ashish-goswami

Reviewable status: 0 of 3 files reviewed, 3 unresolved discussions (waiting on @manishrjain and @martinmr)

dgraph/cmd/bulk/mapper.go, line 92 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

perhaps make this as constant.

Done.

dgraph/cmd/bulk/merge_shards.go, line 30 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

use filepaht.Join to make this logic more robust

Done.

dgraph/cmd/bulk/reduce.go, line 48 at r1 (raw file):

Previously, martinmr (Martin Martinez Rivera) wrote…

also use filepath.Join here

Done.

manishrjain

Reviewed 3 of 3 files at r2.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @martinmr)

Change the mapper output directory from $TMP/shards to $TMP/map_output

27bc0cc

gitlw requested review from manishrjain and a team as code owners September 11, 2019 00:51

pullrequest bot reviewed Sep 11, 2019

View reviewed changes

ashish-goswami approved these changes Sep 11, 2019

View reviewed changes

martinmr suggested changes Sep 11, 2019

View reviewed changes

mangalaman93 assigned mangalaman93 and ashish-goswami and unassigned mangalaman93 Sep 17, 2019

Address review comments

756d9bd

ashish-goswami reviewed Sep 18, 2019

View reviewed changes

manishrjain approved these changes Sep 18, 2019

View reviewed changes

ashish-goswami merged commit cd0e208 into master Sep 19, 2019

ashish-goswami deleted the gitlw/change_mapper_dir branch September 19, 2019 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the mapper output directory from $TMP/shards to $TMP/map_output #3960

Change the mapper output directory from $TMP/shards to $TMP/map_output #3960

gitlw commented Sep 11, 2019 •

edited by manishrjain

Loading

pullrequest bot left a comment

pullrequest bot left a comment

ashish-goswami left a comment

mangalaman93 commented Sep 11, 2019

gitlw commented Sep 11, 2019

martinmr left a comment

ashish-goswami left a comment

manishrjain left a comment

Change the mapper output directory from $TMP/shards to $TMP/map_output #3960

Change the mapper output directory from $TMP/shards to $TMP/map_output #3960

Conversation

gitlw commented Sep 11, 2019 • edited by manishrjain Loading

pullrequest bot left a comment

Choose a reason for hiding this comment

pullrequest bot left a comment

Choose a reason for hiding this comment

ashish-goswami left a comment

Choose a reason for hiding this comment

mangalaman93 commented Sep 11, 2019

gitlw commented Sep 11, 2019

martinmr left a comment

Choose a reason for hiding this comment

ashish-goswami left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

gitlw commented Sep 11, 2019 •

edited by manishrjain

Loading