[DataPipe] file cache by tmbdev · Pull Request #407 · meta-pytorch/data

tmbdev · 2022-05-13T20:10:22Z

This PR adds a file caching filter. The filter receives (fname, stream) pairs; if necessary, it will download all the data in the stream to a local file based on the filename. Then it will pass an (fname, stream) pair to the next state.

This is particularly useful with WebDataset, where FileCache can be used to cache shards incrementally as they are downloaded from remote locations, but the filter works with arbitrary (fname, stream) pairs.

VitalyFedyunin · 2022-05-19T19:36:58Z

This can be archived with existing on_disk_cache DataPipe, PTAL:

        nfiles = 100
        testdata = "hello, world"
        dest = os.path.join(self.temp_dir.name, "testdata")
        with open(dest, "w") as stream:
            stream.write(testdata)

        dp = IterableWrapper([dest] * nfiles)

        def _noop(x):
            return x

        dp = dp.on_disk_cache(filepath_fn=_noop)

        # This could be download, for for sake of example
        # # I just writing text into the file
        def _write(filename):
            with open(filename, 'w') as fh:
                fh.write(testdata)
        dp = dp.map(lambda filename: _write(filename))

        dp = dp.end_caching(mode="t", filepath_fn=_noop, timeout=120)
        dp = FileOpener(dp)

        count = 0
        for path, stream in dp:
            data = stream.read()
            count += 1
            assert data == testdata
        assert count == nfiles

tmbdev · 2022-05-20T02:15:27Z

I'm not particularly attached to my implementation, but I think file caching is something that people should be able to add very easily to a pipeline with just "dp.filecache(dirname)".

(In fact, it might be a good idea to just have it default to the environment variable and not cache if the environment variable is unset.)

I haven't seen use cases for the generality that the current cache implementation provides and think that it will discourage the use of caching. Also, it looks like your caching implementation may mix downloading with caching, whereas the .popen/.filecache combo separates it, making it easier to make training pipelines location transparent.

Tom added 2 commits May 13, 2022 12:28

merged

b23cef5

added filecache

360b1c6

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022

VitalyFedyunin self-requested a review May 19, 2022 19:37

VitalyFedyunin changed the title ~~file cache~~ [DataPipe] file cache May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataPipe] file cache#407

[DataPipe] file cache#407
tmbdev wants to merge 2 commits intometa-pytorch:mainfrom
tmbdev:wdsfilecache

tmbdev commented May 13, 2022

Uh oh!

VitalyFedyunin commented May 19, 2022

Uh oh!

tmbdev commented May 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tmbdev commented May 13, 2022

Uh oh!

VitalyFedyunin commented May 19, 2022

Uh oh!

tmbdev commented May 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants