-
Notifications
You must be signed in to change notification settings - Fork 115
Correct physical schema inference for DF files with non-logical types. #5976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
4f1074e to
86a6685
Compare
CodSpeed Performance ReportMerging this PR will not alter performanceComparing Summary
Footnotes
|
Signed-off-by: Alfonso Subiotto Marques <[email protected]>
Signed-off-by: Adam Gutglick <[email protected]>
86a6685 to
5f66406
Compare
Signed-off-by: Adam Gutglick <[email protected]>
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Codecov Report❌ Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb S3Summary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on S3Summary
Detailed Results Table
|
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
| /// | ||
| /// For these types, we use the logical schema's type instead of the DType's natural Arrow | ||
| /// conversion, since Vortex's Arrow executor can produce these types when requested. | ||
| fn calculate_physical_file_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this. See my comments on the callsites.
| let physical_file_schema = Arc::new(vxf.dtype().to_arrow_schema().map_err(|e| { | ||
| exec_datafusion_err!("Failed to convert file schema to arrow: {e}") | ||
| })?); | ||
| let physical_file_schema = Arc::new(calculate_physical_file_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the nomenclature here is a bit confusing because datafusion's classic interpretation of logical (i.e. table types) vs physical (i.e. file types) doesn't really make sense in a vortex world, where the output arrow types can be dynamic, not static. I think s/physical_file_schema/this_file_schema and s/logical_file_schema/unified_file_schema would make more sense. What do you think?
Also, given the vortex dynamic output schema nature, calculating this_file_schema should just be the subset of unified_file_schema that the file actually stores, no extra type calculations necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output Arrow types that Vortex reports are always going to be the simple types, e.g. Dictionary and RunEnd will never be reported as a column type even if Vortex encodes it that way.
So I think we need to keep the calculation?
Also I agree that I find DF's nomenclature confusing here. In my head, I always call this the "reader schema" (logical file schema) and the "writer schema" (physical file schema).
If we don't do the calculation here, then by default we never pushdown the cast to Dictionary/RunEnd and always end up doing it in Arrow after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but what I am suggesting is removing any reference to vxf.dtype() here. The “calculation” would become just finding the subset of columns in logical_file_schema (physical arrow types) that this specific file stores. This should all be done by name.
The complexity of the current calculation that chooses the physical arrow type from one schema or the other based on the type itself is the part that is not necessary I think especially since the physical arrow types from the logical_file_schema are what is “desired” by datafusion and what you need to end up outputting (minus projections) any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this conversation split up a bit, but I do think its important to keep pulling from vxf.dtype() and use the names, that's how the expression adapter rewrites the expression to take advantage of the data's concrete structure (in the case of filters like non_exist_col > 5)
| exec_datafusion_err!("Couldn't get the schema for the underlying Vortex scan") | ||
| })?; | ||
| let stream_schema = | ||
| calculate_physical_file_schema(&scan_dtype, &projected_physical_schema)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be brittle because it's a lookup by name which can be changed with aliases etc... What about retaining the physical arrow type information when converting expressions in split_projection? You could then get a post-scan-projection arrow schema by calling something on the ProcessedProjection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that its brittle, but it should also be able to maintain the names as both (should) be projected in the same way. I'm thinking through the comments, would definitely love to make this more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a very similar flow right? Just only available after splitting the projection which should make sure all the names match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the flow would be pretty much exactly the same, just that every DF expression you successfully convert to vortex would also keep the DF result type, i.e. you calculate the physical arrow schema in the split_projection recursion. I'll defer to you on what to do, but I think we can avoid the complicated calculate_physical_file_schema this way.
|
|
||
| assert_eq!(filtered_scan[0].schema(), read_schema); | ||
|
|
||
| assert_batches_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be assert_batches_sorted_eq! as well
Signed-off-by: Adam Gutglick <[email protected]>
|
We also would need to add |
I don't think so, direct conversion is taken care of by vortex/vortex-array/src/arrow/executor/mod.rs Line 122 in cc81639
|
This PR fixes how a file or a scan's
DTypeis mapped into an arrow types in cases where the table has non-logical types that don't round trip between the types systems.This change both simplifies how we push projection expressions and allows us to save on unnecessary work converting arrays between various logically-equivalent types.