Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update worker hash ring #58

Merged
merged 22 commits into from
Feb 21, 2024
Merged

Update worker hash ring #58

merged 22 commits into from
Feb 21, 2024

Conversation

LuQQiu
Copy link
Contributor

@LuQQiu LuQQiu commented Feb 7, 2024

Update worker hash ring to use worker identity, align with AOS 309

@LuQQiu
Copy link
Contributor Author

LuQQiu commented Feb 7, 2024

Add and create the test file via adding following code to dora

ConsistentHashProviderTest
  @Test
  public void sourceOfTruth() throws IOException {
    Gson gson = new GsonBuilder().setPrettyPrinting().create();
    ConsistentHashProvider provider = new ConsistentHashProvider(
        100, 100000000, 5);
    Set<WorkerIdentity> workerList = generateRandomWorkerList(50);
    // set initial state
    provider.refresh(workerList);
    NavigableMap<Integer, WorkerIdentity> map = provider.getActiveNodesMap();

    try (FileWriter writer = new FileWriter("/Users/alluxio/downloads/testData/workerList.json")) {
      gson.toJson(workerList, writer);
    }

    // Serialize and write active nodes map to a file
    try (FileWriter writer = new FileWriter("/Users/alluxio/downloads/testData/activeNodesMap.json")) {
      gson.toJson(map, writer);
    }

    List<String> ufsUrls = Arrays.asList(
        "s3://bucket/path/to/dir",
        "s3://bucket/path/to/file",
        "s://ai-testing/",
        "hdfs://host:port/path/to/dir",
        "hdfs://host:port/path/to/file",
        "s3://ai-ref-arch/yelp-review/model.pt",
        "s3://ai-ref-arch/yelp-review/yelp_academic_dataset_business.json",
        "s3://ai-ref-arch/yelp-review/yelp_academic_dataset_checkin.json",
        "s3://ai-ref-arch/yelp-review/yelp_academic_dataset_review.json",
        "s3://ai-ref-arch/yelp-review/yelp_academic_dataset_tip.json",
        "s3://ai-ref-arch/yelp-review/yelp_academic_dataset_user.json",
        "s3://ai-ref-arch/yelp-review/yelp_review_sample.csv",
        "s3://ai-ref-arch/yelp-review/yelp_review_sample_large.csv",
        "hdfs://namenode:8020/user/hadoop/dir",
        "hdfs://namenode:8020/user/hadoop/file.txt",
        "wasbs://container@account.blob.core.windows.net/dir",
        "wasbs://container@account.blob.core.windows.net/file.txt",
        "gs://bucket/dir",
        "gs://bucket/file.txt",
        "gcs://bucket/dir",
        "gcs://bucket/file.txt"
    );

    try (FileWriter writer = new FileWriter("/Users/alluxio/downloads/testData/fileUrlWorkers.json")) {
      Map<String, List<WorkerIdentity>> fileUrlWorkers = new HashMap<>();
      for (String ufsUrl : ufsUrls) {
        List<WorkerIdentity> selectedWorkers = provider.getMultiple(ufsUrl, 5);
        
        fileUrlWorkers.put(ufsUrl, selectedWorkers);
      }
      gson.toJson(fileUrlWorkers, writer);
    }
  }

@ChunxuTang @jja725 Cannot get the hash ring result correct so far

@LuQQiu LuQQiu changed the title [WIP] Test worker hash ring Update worker hash ring Feb 15, 2024
@LuQQiu LuQQiu requested review from jja725 and ChunxuTang February 15, 2024 23:08
Copy link
Member

@ChunxuTang ChunxuTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me. Just left some minor comments.

BTW, I saw there are quite a number of commits in this PR. Could you squash them into fewer comments?

Comment on lines 117 to 118
worker_hosts (str, optional):
The worker hostnames in host1,host2,host3 format. Either etcd_hosts or worker_hosts should be provided, not both.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest keeping the worker_hosts for compatibility, although the functionality is probably broken at this moment. Instead, we can add a warning when users only provide workers without etcd host(s).

This is a change of the AlluxioFileSystem, the interface used by other applications. If we remove the variable (backward compatibility breaks), we have to update the fsspec. Afterward, when we add worker hosts back, we'll have to update the fsspec again (break again).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, i will leave it here and fix it in the later static membership PR

Comment on lines 135 to 142
if etcd_hosts is None and worker_hosts is None:
raise ValueError(
"Must supply either 'etcd_hosts' or 'worker_hosts'"
)
if etcd_hosts and worker_hosts:
raise ValueError(
"Supply either 'etcd_hosts' or 'worker_hosts', not both"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing it to a warning if users only provide worker hosts without etcd hosts.

Comment on lines -117 to +116
self.host = host
self.port = port
self._host = host
self._port = port
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing other parts of the code, I would say we generally use names without the leading underscore for data members. Changing them back to self.host and self.port?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the google python convention is using underscore to indicate private member values. I remember some other parts have changed to this new convention

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about i create another PR to fix other parts of the code base after this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Thanks.

self.etcd_username = None
self.etcd_password = None
self.prefix = ETCD_PREFIX_FORMAT.format(
self._etcd_username = None
self._etcd_password = None
self._prefix = ETCD_PREFIX_FORMAT.format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

worker_addresses = []
for worker_identity in worker_identities:
worker_address = self._worker_info_map.get(worker_identity)
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary to have the parenthesis around the worker_address?

if worker_addresses != self._worker_addresses:
self._update_hash_ring(worker_addresses)
worker_info_map = {}
detect_diff_in_worker_info = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> diff_in_worker_info_deteced

Copy link
Contributor

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement! Make some comments

self.logger = logger or logging.getLogger("AlluxioFileSystem")
if etcd_hosts and not worker_hosts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use Union type in python to indicate OR semantic

self.session = self._create_session(concurrency)

# parse options
page_size = ALLUXIO_PAGE_SIZE_DEFAULT_VALUE
hash_node_per_worker = int(ALLUXIO_HASH_NODE_PER_WORKER_DEFAULT_VALUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we put default value as string and convert here? We can just define as int at the beginning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChunxuTang previously put Alluxio related key=value string pair into the const file, we can discuss about what are the best way for dealing with property key and values

if options:
if ALLUXIO_PAGE_SIZE_KEY in options:
page_size = options[ALLUXIO_PAGE_SIZE_KEY]
self.logger.debug(f"Page size is set to {page_size}")
if ALLUXIO_HASH_NODE_PER_WORKER_KEY in options:
hash_node_per_worker = int(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The const is all string format, the const file is kind of Alluxio property key files now

worker_entity.worker_identity
] = worker_entity.worker_net_address
if worker_entity.worker_identity not in self._worker_info_map:
diff_in_worker_info_detected = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can early break here if we detect diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker_info_map[
worker_entity.worker_identity
] = worker_entity.worker_net_address

here we also need to update the local worker_info_map for update hash ring

@LuQQiu
Copy link
Contributor Author

LuQQiu commented Feb 20, 2024

The docker related tests failed with cannot launch worker issues

Copy link
Member

@ChunxuTang ChunxuTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. Just left some minor comments.

@@ -141,20 +143,32 @@ def __init__(
"Supply either 'etcd_hosts' or 'worker_hosts', not both"
)
self.logger = logger or logging.getLogger("AlluxioFileSystem")
if etcd_hosts and not worker_hosts:
self.logger.warning(
"Does not supply 'etcd_host'. Cannot tolerate Alluxio cluster changes."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd_host => etcd_hosts
Cannot tolerate Alluxio cluster changes. => A etcd cluster is required for dynamic cluster changes.

Comment on lines -117 to +116
self.host = host
self.port = port
self._host = host
self._port = port
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Thanks.

@LuQQiu LuQQiu merged commit 6c76b86 into Alluxio:dev Feb 21, 2024
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants