What happened?
public boolean isMinorNecessary() {
int smallFileCount = fragmentFileCount + equalityDeleteFileCount;
return smallFileCount >= config.getMinorLeastFileCount()
|| (smallFileCount > 1 && reachMinorInterval())
|| combinePosSegmentFileCount > 0;
}
protected boolean reachMinorInterval() {
return config.getMinorLeastInterval() >= 0
&& planTime - lastMinorOptimizingTime > config.getMinorLeastInterval();
}
If a table has some partitions with many small files and others with only two or three small files, the condition (smallFileCount > 1 && reachMinorInterval()) for those partitions with just two or three small files will never evaluate to true. Consequently, these partitions will never be included in minor optimizations. Essentially, reachMinorInterval should be evaluated at the partition level rather than the table level.
Affects Versions
0.8.1
What table formats are you seeing the problem on?
Iceberg
What engines are you seeing the problem on?
Spark
How to reproduce
No response
Relevant log output
Anything else
protected boolean reachMinorInterval() {
if (config.getMinorLeastInterval() < 0) {
return false;
}
long interval = planTime - lastMinorOptimizingTime;
if (interval > config.getMinorLeastInterval()) {
return true;
}
return isDifferentDay(lastMinorOptimizingTime, planTime);
}
Perhaps the reachMinorInterval can be modified to follow the aforementioned logic, ensuring that it evaluates to true at least once per day. This way, partitions with only two or three small files will also have a chance to be optimized.
Are you willing to submit a PR?
Code of Conduct
What happened?
If a table has some partitions with many small files and others with only two or three small files, the condition
(smallFileCount > 1 && reachMinorInterval())for those partitions with just two or three small files will never evaluate to true. Consequently, these partitions will never be included in minor optimizations. Essentially,reachMinorIntervalshould be evaluated at the partition level rather than the table level.Affects Versions
0.8.1
What table formats are you seeing the problem on?
Iceberg
What engines are you seeing the problem on?
Spark
How to reproduce
No response
Relevant log output
Anything else
Perhaps the reachMinorInterval can be modified to follow the aforementioned logic, ensuring that it evaluates to true at least once per day. This way, partitions with only two or three small files will also have a chance to be optimized.
Are you willing to submit a PR?
Code of Conduct