Skip to content

Conversation

@morningman
Copy link
Contributor

@morningman morningman commented Dec 10, 2025

picked from #58166 #58606 #58748

zy-kkk and others added 3 commits December 10, 2025 11:23
…che#58166)

For Hive tables with massive partitions (10K+), INSERT operations are
extremely slow because:
- FE fetches all partition metadata from HMS directly (expensive RPC
calls)
  - Full table cache invalidation after each insert (unnecessary)

Problem Summary:

1. **Use cache for partition metadata in INSERT**
- FE now fetches partition info from cache instead of directly querying
HMS when preparing INSERT
  - Avoid expensive HMS RPC calls for every INSERT operation

2. **Selective cache refresh after commit**
  - Only invalidate affected partitions instead of full table cache
  - Based on partition update info from BE (NEW/APPEND/OVERWRITE)
  - Significantly reduces cache invalidation overhead

3. **Handle cache inconsistency gracefully**
- When BE marks partition as NEW but it already exists in HMS (cache
miss)
- FE detects this by checking HMS and treats it as APPEND instead of
failing
  - Prevents `AlreadyExistsException` errors

For tables with partitions:
  - **Before**: HMS calls per INSERT + full cache invalidation
  - **After**: cache lookup + selective partition refresh
  - Expected speedup: 10x-100x for partition metadata fetching phas
…les (apache#58606)

### Problem

Reproduction Steps: Create a Hive Catalog, create an unpartitioned
table, then insert data. The following failure occurs.

```
copy file failed: software.amazon.awssdk.services.s3.model.NoSuchKeyException: The specified key does not exist. (Service: S3, Status Code: 404,
```

The BE mistakenly treats non-partitioned tables as partitioned ones. For
partitioned tables, the system always appends a folder suffix for each
partition, organizing data into partition directories. However,
non-partitioned tables do not require partition information. In this
case, the BE incorrectly added a partition folder suffix for
non-partitioned tables, causing the insert operation to fail.

### Solution
- Skip setting partition information for non-partitioned tables in the
BE.
- Maintain current behavior for partitioned tables, including folder
suffix handling.

### Result
- Inserts into non-partitioned object storage tables succeed.
- Partitioned tables continue to work as expected.

This issue was introduced in apache#58166
…apache#58748)

### What problem does this PR solve?

Followup apache#58166
In apache#58166, the edit log need record "modified partitions" and "new
partitions" separately,
so that non-master FE can correctly update the partition cache.
Otherwise, some new partitions can not be queried in non-master FE after
inserting.
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 24.48% (35/143) 🎉
Increment coverage report
Complete coverage report

@morningman
Copy link
Contributor Author

LGTM

@morningman morningman changed the title [opt](hive) Speed up Hive insert on partition tables using cache (#58166)(#58606)(#58748) branch-3.1: [opt](hive) Speed up Hive insert on partition tables using cache (#58166)(#58606)(#58748) Dec 15, 2025
@morrySnow morrySnow changed the title branch-3.1: [opt](hive) Speed up Hive insert on partition tables using cache (#58166)(#58606)(#58748) branch-3.1: [opt](hive) Speed up Hive insert on partition tables using cache #58166 #58606 #58748 Dec 15, 2025
@morrySnow morrySnow merged commit a5ce97a into apache:branch-3.1 Dec 15, 2025
25 of 26 checks passed
zy-kkk added a commit to zy-kkk/doris that referenced this pull request Dec 25, 2025
morrySnow pushed a commit that referenced this pull request Dec 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants