Skip to content

Conversation

@pingtimeout
Copy link
Contributor

As part of #3256 , the following test libraries were added from the Nessie project:

  • objet-storage-mock - Allows for mocking the object storage layer and simulate any Iceberg table, including tables with extremely large number of files
  • azurite-testcontainer - Allows for simulating Azure Storage with testcontainers in Polaris codebase
  • gcs-testcontainer - Allows for simulating GCS with testcontainers in Polaris codebase

There was an explicit request to pull the code into Polaris, instead of depending on that code as a library (#3256 (comment), #3256 (review)). This is what this PR does.

This PR introduces three test-only libraries in the tools directory so that they can be used by #3256.

In case it can be useful, here is a bash script that generates diff commands for all files added in this PR against a Nessie folder. It can be used to verify that the code that comes from Nessie has not been modified, apart from package names, code-copied-to-polaris marker and other minor tweaks.

I have also tested the current PR by integrating it with #3256 and verifying that all the tests from #3256 still pass.

Checklist

  • 🛡️ Don't disclose security issues! (contact [email protected])
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)

@pingtimeout pingtimeout force-pushed the test-libs-for-storage-operations branch from fb6999e to 7d138bf Compare January 23, 2026 12:49
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your diligent porting work, @pingtimeout !

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jan 23, 2026
@RussellSpitzer
Copy link
Member

An easy fix here to trim the size of this PR would be to split it into one per container rather than moving all of them at once. Just a guide for the future.

@@ -0,0 +1,23 @@
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this all new? Just noting it has the Apache license and not the Dremio one

public class S3Resource {
@Inject ObjectStorageMock mockServer;

private static final Owner TEST_OWNER = Owner.of(42, "nessie-iceberg-s3-mock");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be confusing later, not a big deal for now though

@singhpk234
Copy link
Contributor

i believe we are reinventing the wheel here, why not use something like S3Mock ?
https://github.com/adobe/S3Mock, it a well maintained library ~1k stars

can we do more thorough analysis on why we need this custom code and not use something like libraries like S3Mock

@RussellSpitzer
Copy link
Member

I'm fine with adding this for now, since it's just a test dependency, if we want to use S3 mock or something like that in the future it's still available.

@dimas-b
Copy link
Contributor

dimas-b commented Jan 23, 2026

@singhpk234 :

i believe we are reinventing the wheel here, [...]

In short, the technical capabilities of these testing utilities are different from S3Mock, which is why they were developed under Project Nessie in the first place. One such aspect is InterceptingBucket, which is quite useful in tests. This functionality is currently used in #3256, where this PR originates.

As @RussellSpitzer noted, it is certainly possible to make adjustments later if developers find more efficient ways to work with and test the code in #3256.

@singhpk234
Copy link
Contributor

singhpk234 commented Jan 23, 2026

Thank you for sharing this @dimas-b ! I believe thats precisely i was looking to get answer for !

I wonder if can't we adjust the tests to work with s3mocks rather than writing whole req / responses even introducing things like Owner.

my understanding was that iceberg uses s3 mock pointer, even trino uses s3 mock Trino so i was confused on what exactly are we testing thats not possible with s3mock as we see other projects which do similar stuff is capable of.

I believe we just want to check it in and use it for tests (#3256) since its tests we are fine, without changing we have and think about it later ?

@flyrain
Copy link
Contributor

flyrain commented Jan 26, 2026

I think @singhpk234 has a good point. Given the size of this change, around 5600 LOC(79 new files), I think it is worth pausing and discussing this on the dev mailing list. Reusing an existing, well maintained library where possible is generally preferable, and we should be careful about reinventing the wheel unless there is a clear technical gap that existing solutions cannot cover.

It would be good to have a broader discussion on dev ML about:
• what concrete requirements are not met by existing libraries like S3Mock
• whether those gaps are fundamental or could be addressed with extension or configuration
• the long term maintenance cost of carrying a large custom test stack in Polaris

If we agree that custom code is still the right direction after that discussion, at least we will have alignment and a documented rationale.

Happy to help start a dev list thread if that sounds reasonable.

@jbonofre
Copy link
Member

I agree for a discussion on the dev mailing list. Happy to drive it if needed.
Also this PR is large with code donation (I have to check if it doesn't require SGA regarding the scope and size).
I'm not convinced we need everything in there. Maybe worth to cleanup and see what's actually needed.

@jbonofre
Copy link
Member

If it's a concern about the size of this PR, I wonder if it's not better to be back on the initial proposal: using nessie test dependecy as a library and create an issue to find alternative.
I think it would be reasonable because some asked to remove the nessie test dependency in favor of the code copy, and now we have concerns about this code.
It looks like we do ping pong here.
As a consensus I propose:

  1. Let's use the Nessie test dependency (it doesn't hurt)
  2. Create an issue to keep in mind to find alternative (using s3mock or others)
    Thoughts ?

@dimas-b
Copy link
Contributor

dimas-b commented Jan 26, 2026

Let's use the Nessie test dependency (it doesn't hurt)

+1 to this from my POV. I believe it is the easiest option both in terms of license (Nessie is ALv2) and maintenance (the code is covered by active CI under Project Nessie).

@flyrain
Copy link
Contributor

flyrain commented Jan 26, 2026

Just to clarify my position, I am not against adding dependencies or accepting donations. My concern is more about locking ourselves into a single option too early. I think it would be good to explore a bit more, look at multiple libraries or approaches, and understand what concrete requirements we actually have before converging. Is that allowed by the community?

From that angle, a discussion on the dev mailing list feels reasonable, both to validate the direction and to surface any other concerns the wider community may have.

Happy to proceed once we have that shared context and alignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants