Incremental processing or streaming in micro-batches #5917

kskyten · 2017-12-05T14:38:16Z

kskyten
Dec 5, 2017

It seems like it is only possible to replace a dataset entirely and then re-run the analysis. Incremental processing would enable more efficient processing by avoiding recomputation. Here's how Pachyderm does it.

dmpetrov · 2017-12-05T17:57:07Z

dmpetrov
Dec 5, 2017
Maintainer

Hi @kskyten
Could you please give us a bit more details? Are you talking about deep learning style computations when a result of each epoch will be saved and then, possible, reused? If so, how do you save the results - as a single file or a set of files in an input directory (like in Pachyderm docs)?

0 replies

kskyten · 2017-12-11T00:02:08Z

kskyten
Dec 11, 2017
Author

Let's say you have a folder of images and you define a computationally intensive pipeline for them. After having defined this pipeline you acquire new images that you want to process. This corresponds to the inter-datum incrementality in Pachyderm, if I understood correctly.

0 replies

dmpetrov · 2017-12-12T07:08:09Z

dmpetrov
Dec 12, 2017
Maintainer

We are going to implement Dependency to directory #154 which is a little bit related to this incremental feature. But it is not enough.

This is a good scenario. Let's keep this feature request - it will be implemented. We just need to decide a priority.

0 replies

ghost · 2019-02-05T00:05:16Z

ghost
Feb 5, 2019

@dmpetrov , are we there yet or we should keep this open?

0 replies

efiop · 2019-02-05T00:09:54Z

efiop
Feb 5, 2019

@MrOutis Yep, this should work now by using dvc unprotect: https://dvc.org/doc/user-guide/how-to/update-tracked-files

Closing, please feel free to reopen.

0 replies

dmpetrov · 2019-02-05T00:30:32Z

dmpetrov
Feb 5, 2019
Maintainer

Guys, this is actually a request for streaming\micro-batch processing.

Scenario: I have a dir with images and I'd like to process each image with dwt_ll4.py to generate a spectrum matrixes. If new 23 images were added I'd like to run the script only for these 23 images.

It is mostly data engineering scenario, not data science. So, we should think carefully if DVC should handle this or not. But it is still a valid case and it will be especially useful when we implement multi-processing #755.

So, I'm renaming and reopening.

0 replies

ghost · 2019-02-05T00:32:41Z

ghost
Feb 5, 2019

Thanks, @dmpetrov

0 replies

Pierre-Bartet · 2019-11-15T10:53:39Z

Pierre-Bartet
Nov 15, 2019

Hi, has this eventually been implemented ?

0 replies

pared · 2019-11-15T13:40:27Z

pared
Nov 15, 2019

Hi @Pierre-Bartet, it have not been yet implemented. Would you find this feature useful? Could you describe your use case?

0 replies

Pierre-Bartet · 2019-11-15T14:46:35Z

Pierre-Bartet
Nov 15, 2019

@pared Use cases equivalent to the one already described:

Scenario: I have a dir with images and I'd like to process each image with dwt_ll4.py to generate a spectrum matrixes. If new 23 images were added I'd like to run the script only for these 23 images.

In fact this is exactly at this point that data version control tools could become really useful because the situation is messier than in the 'vanilla' use case. AFAIK Pachyderm is the only tool with this kind of feature but it is super heavy. I think what is needed is to be able to

Add an empty collection of objects:
$ dvc add --collection my_collection
Create a pipeline using this collection as a dependency, applying computation.py separately on each object of the collection:
$ dvc run --d my_collection computation.py
Later add new files to the collection:
$ dvc add new_file(s) --to my_collection

So that a dvc repro would only run on needed files.

The hard part is that:

$ dvc run --d my_collection computation.py

could both mean:

'apply computation.py separately on each object of the collection' -> If for example we want to substract the mean of each image from each pixel
'apply computation.py on the whole collection' -> If for example we want to substract the global mean of all the images from each pixel

So I guess the '--d' should be replaced by something else to distinguish between those two cases.

0 replies

pared · 2019-11-19T13:21:57Z

pared
Nov 19, 2019

could both mean

'apply computation.py separately on each object of the collection' -> If for example, we want to subtract the mean of each image from each pixel

'apply computation.py on the whole collection' -> If for example, we want to subtract the global mean of all the images from each pixel

I don't think that distinguishing whether to apply a given code on each image or all images should be DVC responsibility. DVC is an abstraction that takes care of tracking code and its dependencies. But there are too many use cases of DVC (any data science project, so data can range from CSV to folder full of files).
This use case(apply on image vs dataset) assumes that DVC is aware of code logic and dependency structure, while it should not be.

Having said that I think that we could implement this feature (applying on "new" data) for directory dependencies, but it would need to be users' conscious decision to use it.
One option could be, for example, flag-like:
dvc run -d modified_folder -o spectrum_matrices_folder --incremental code.py
Implementation won't be that easy though, because we should probably create some temporary dir that would store "only new" dependency files, and apply code to this temp dir and append output to already existing output of this stage.

Some notes:

How can we be sure that output md5 is the same as it would be if we just run processing on the whole modified_folder? Does it even matter if the user made a conscious decision to use --incremental?
In the case of code using information about the whole dataset, this feature can produce wrong outputs.
(e.g subtract mean of all images, that case requires running the code on the whole dataset, in other case mean of all images won't be mean of all images, just batch mean)
(At least in the beginning) We would need verification that this option is used for dir dependency

After writing this, it seems to me that my proposition for this solution is wrong since it can easily be misused. Leaving this here for further discussion. Maybe we could implement this as advanced use case but that seems dangerously similar to --outs-preserve. Or maybe someone has a better idea of how it should be done.

0 replies

shcheklein · 2019-11-20T04:42:34Z

shcheklein
Nov 20, 2019
Maintainer

@Pierre-Bartet @pared an interesting discussion. I also don't see a way how DVC being completely agnostic to the way code and files are organized now can solve this now. Thinking about the right abstraction for the job, I have first the question back:

Let's imaging we do:

$ dvc run --d my_collection computation.py

How does this script organized internally? How does it read the data from this my_collection?

I'm asking because right now like I mentioned we don't dictate any of that. But when we start doing things incrementally how do pass that delta information to the script? How will it know that it needs to read only certain files from the directory?

@Pierre-Bartet do you have some ideas that comes to your mind?

0 replies

Pierre-Bartet · 2019-11-20T09:37:20Z

Pierre-Bartet
Nov 20, 2019

I've been thinking about it but I see no solution that would not totally change DVC's behaviour, mostly because of what you just said:

when we start doing things incrementally how do pass that delta information to the script?

The behaviour I had in mind was to have the dependencies given as arguments to the scripts (here computation.py) so that dvc run could either:

call repeteadly computation.py file_i
call a single time computation.py folder

But that is indeed a huge deviation from the current dvc behaviour

0 replies

markrowan · 2019-11-21T16:04:05Z

markrowan
Nov 21, 2019

This is also a feature which I would like to see handled better with DVC if possible. Currently, the only way to avoid expensive re-computation on partial data sets when adding new data is to bypass DVC altogether, execute the processing on only the new files, and then run dvc commit on the results -- which seems rather to detract from the whole point of using DVC in the first place to track changes to datasets. And yet, it must be possible to avoid re-processing tens of thousands of files just because a single new file was added to an input directory.

@Pierre-Bartet, my thoughts for a possible behaviour were also in the same direction as yours, passing [delta] dependencies to the script being run.

A gentler approach along the same lines may be the following:

DVC could maintain an index of filepaths changed since the last commit, for example in a section of the .dvc file of the dependency in question
It's up to the script called by dvc run whether it wants to read and make use of this index to filter the input files, or to ignore the information and process everything as usual
After a successful run, the index is flushed and the hashes re-computed

This is for sure "advanced" behaviour and requires users to be aware of the existence of the index. But at least it doesn't mean DVC having to change its behaviour in regard to how it calls commands with dvc run.

I realise this suggestion isn't necessarily entirely thought through, for which I apologise. Please feel free to tear it down where necessary :)

0 replies

ghost · 2019-11-21T21:38:32Z

ghost
Nov 21, 2019

I find this feature really useful 😅
Not ML related but with ETL is pretty common to do such things.
Having a cache is not useful if you are still recomputing files.

The simplest scenario that I can imagine is by giving the user a way to tell which files are new, for example:

mkdir data
for i in {1..3}; do echo $i > data/$i; done
dvc add data

When you add more files:

for i in {4..10}; do echo $i > data/i; done

Doing dvc ls --outs --new data.dvc would output:

that way the user would iterate over all of those files and run the respective command and then dvc commit data.

The implementation for ls --outs --new would be something like the following:

from dvc.repo import Repo
from dvc.stage import Stage

repo = Repo('.')
stage = Stage.load(repo, 'data.dvc')
out = stage.outs[0]
path = out.path_info

cached = set(path / info['relpath'] for info in out.dir_cache)
current = set(out.remote.walk_files(path))
new = cached - current

for path in new:
    print(new)

We could start with a simple implementation and add more features depending on users request, what do you think?

A lot of manual work, but if you want to process everything with a single call, like @Pierre-Bartet suggested in #331 (comment), you can link those files to a new directory and use it as an argument for your script:

new_stuff=$(mktemp -d)

for file in $(dvc ls --new data.dvc); do
    ln -s $(realpath $file) $new_stuff
done

python script.py $new_stuff && rm -rf $new_stuff

0 replies

Pierre-Bartet · 2019-12-06T07:49:40Z

Pierre-Bartet
Dec 6, 2019

@shcheklein : Indeed what we've been discussing with @Suor is a different kind of 'incremental'

0 replies

charlesbaynham · 2020-01-28T12:26:32Z

charlesbaynham
Jan 28, 2020

This issue describes our use case as well. We have raw data which is produced continuously. I intended to bin this into e.g. daily chunks, then run multiple stages of processing on each chunk, possibly using the inputs of other stages from the same time bin. My problem is that I don't want to reprocess years worth of data every day when a new file is added to the raw data, but I also don't want to manually create thousands of processing stages for every raw data file and all their children.

0 replies

karajan1001 · 2020-04-18T03:39:28Z

karajan1001
Apr 18, 2020

I think this depends on the what the DVC think itself to be?

For me it's a data tracking tool in training staging, and data tracking and metrics comparing help me a lot when I make some experiments to the model, after that I use other tools to deploy my model.

So I use DVC as a data tracking tool like git, not a computation engine .I don't know what kind of tool, you guys want it to be

0 replies

shcheklein · 2020-06-04T20:14:50Z

shcheklein
Jun 4, 2020
Maintainer

One more case and request - https://discordapp.com/channels/485586884165107732/563406153334128681/718108510671601824

Hi! My current main issue with DVC is the fact that the pipelines are structured with a stage = few inputs -> a few outputs. That works great on some cases, but it fail when I need to apply the same operation on many files (tens of thousands in my case). Eg running a ML model for detecting the keypoints of a person. Basically there is no map operation. Do you know if that is on the roadmap, or if there is a github issue about that?

0 replies

Suor · 2020-06-05T08:19:26Z

Suor
Jun 5, 2020

So we might have map-style incremental processing and reduce-style one. The map seems like a lot simpler should we separate it?

0 replies

shcheklein · 2020-06-06T21:28:02Z

shcheklein
Jun 6, 2020
Maintainer

@Suor it seems to me that both map or reduce would require some general mechanism first - how to apply certain logic only to new files. The way that logic being applied and actual output is being produced seems to be a simpler question to solve (and we even have some hacks like persistent outputs). Unless I'm missing something. Could you clarify, why do you think they are inherently different?

0 replies

MatthieuBizien · 2020-06-09T18:46:46Z

MatthieuBizien
Jun 9, 2020

I am the author of the comment above on Discord:

One more case and request - https://discordapp.com/channels/485586884165107732/563406153334128681/718108510671601824

I think it's also interesting to look at that use-case in relation with:

"dvc: consider introducing build matrix" dvc: consider introducing build matrix #1018 (already mentioned)
"How to manage repetitive dvc run commands (like unpacking of many zip files) How to manage repetitive dvc run commands (like unpacking of many zip files)? #1119
"Reconfigurable pipelines" Reconfigurable pipelines #1462
"Taking a step back and looking at the big picture" Taking a step back and looking at the big picture #3549

A potential different workaround would be to allow pipelines to create pipelines, so a "master" pipeline could create N different pipelines (one per file to process), and those would be executed one after the other. This would also allow very flexible workflows with dynamic pipeline reconfiguration based on scripts. "ML experiments and hyperparameters tuning" #2799 could be related. I can't find any issue related to that one, is it hidden or should I create it?

0 replies

efiop · 2020-06-09T19:15:57Z

efiop
Jun 9, 2020

@MatthieuBizien Please feel free to create an issue for that :)

0 replies

Pierre-Bartet · 2020-06-10T14:40:14Z

Pierre-Bartet
Jun 10, 2020

I think this depends on the what the DVC think itself to be?

For me it's a data tracking tool in training staging, and data tracking and metrics comparing help me a lot when I make some experiments to the model, after that I use other tools to deploy my model.

So I use DVC as a data tracking tool like git, not a computation engine .I don't know what kind of tool, you guys want it to be

I just want to use it for what its name stands for: "Data Version Control". The difficulty is that since we are talking about data, it can be so big that we don't want to actually commit it or have all of it on our computers (otherwise we would just use git and commit / push / pull / clone everything). Going from code (git) to data (dvc) means you want to be able to deal with only a subset of the full repository:

You want to only partially commit the data (just its hash)
You want to be able to have only a subset of the data on your computer (just its hash and a chosen subset of the full files, or even none of the full files)
You want to be able to process only a subset of the data

I don't want it to be a computation engine but that is a direct consequence of this.
If you have to rerun everything on the whole data every time you make the slightest change, IMHO that breaks the purpose of a data version control system i.e. point 3. is the missing feature.

0 replies

jonilaserson · 2020-07-17T19:22:45Z

jonilaserson
Jul 17, 2020

I hope this suggestion adds something to the above thread.
Say I have 1M files in the directory ./data/pre, and a python script process_dir.py which goes over each file in ./data/pre and processes it and creates a file in the same name in a directory ./data/post. e.g. ./data/pre/123.txt --> ./data/post/123.txt

As we all know, if I define a pipeline:

dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py

then dvc repro will erase all content of data/post and re-run the script from scratch if I add or remove even one file from data/pre, which unfortunately means that those 1M files will unnecessarily be processed again - even though we keep cached copies of the outcomes.

suggestion: if we were able to define a rule that connects directly in the DAG pairs of data/pre/X.txt --> data/post/X.txt in the context of the process stage, then when can adjust the process stage in the pipeline as follows:

identify which file-pairs haven't changed and remove those files to a temp dir
run the process stage as you normally would.
move the file-pairs from the temp dir back to their original locations.

It's like we are enhancing the DAG to include more local dependencies (and independencies), between input files and output files.

0 replies

Suor · 2020-07-20T09:51:25Z

Suor
Jul 20, 2020

@jonilaserson the discussion here circles around that in a way. Your use case is however is a subset of whatever we discuss. It also has a workaround employing an undocumented --outs-persist feature, see above.

0 replies

charlesbaynham · 2020-07-20T10:01:54Z

charlesbaynham
Jul 20, 2020

I'm using DVC for a project which processes continually produces time-series data in a series of steps. I described my use case above a little, but maybe I'll mention how I'm using it now.

My current system wraps DVC in code which produces DVC jobs based on the input data: it scans the input directories, creates a DVC file describing a job for each file that it finds unless a DVC file is already present and then runs DVC repro to compute any steps which aren't yet present.

The benefits of this structure are that, once the wrapper step has run, the repository is a fully valid DVC repository which doesn't need any special tool beyond DVC itself to interact with. In order to make this work, I needed to bin my data into preset chunks and have a way of mapping from inputs to outputs (since those outputs are then used in subsequent steps). In my time-series application, I did this by mandating filenames like "20200720_stepA_1.csv".

To bring this functionality into DVC itself, I'd imagine a couple of changes.

Meta jobs

I found the generation of DVC jobs to work quite well. You could imagine a "meta-job" which defines how input data should be processed to output data, and can then be used in later steps. Meta jobs can form their own DAG, even if there's no input data yet.

Running a metajob creates normal DVC jobs for each valid combination of input chunks present. A chunk is present if either

it depends on raw files (identified somehow in the metajob config) which are present
it depends on outputs from previous metajobs in the metajob DAG, whose input chunks are all present

Metajobs can also depend on normal DVC jobs/files, in which case the dependency is passed to all generated jobs. All metajobs are run before normal jobs, to generate a set of normal jobs for all available input data. The normal jobs can then be run in the same way as they currently are.

Step to step mapping - inputs

The step-to-step mapping would need to be configurable. This would need to be done such that subsequent steps can use the outputs of previous ones. You'd need to decide "do I always map a single step output to a single step input?" I.e. if my input source (call it "stepA") has data like this:

20200718_stepA_0.csv
20200718_stepA_1.csv
20200719_stepA_0.csv
20200720_stepA_0.csv
20200720_stepA_1.csv
20200720_stepA_2.csv

And "stepB" depends on "stepA", should "stepB" be called three times (for the three days, with all inputs for a given day passes as inputs to stepB), six times (for each file, passed individually) or once (easier implemented by passing the whole folder, as in the current version of DVC).

You might achieve this by e.g. defining a regex for inputs. For stepB you might store the regex ^(\d{8})_stepA_\d.csv$ in your step config. This matches all the desired inputs.

For case 2 (stepB should be called six times), this would be enough.
For case 1 (stepB should be called three times, with grouped inputs) you might also specify

groupBy:
  - 1

where this tells DVC to group by the contents of capture group 1 in the regex (i.e. the timestamp).

Step to step mapping - outputs

The outputs would also need defining. "stepB" might only have a single output, with the form $1_stepB_0.csv. This is a regex replacement string, and should resolve to a specific, single filename for any of the inputs which match the input regex.

Input passing

Since your steps are being called with different inputs each time, you'll also need some way of passing these inputs to the command that's being called. That might involve a substitution token in the command, like python process_files.py %INPUT_1% or, for a variable-length list of input files, python process_files.py %INPUTS_1% with an additional field input_1_format: "-f %INPUT% ", resulting in a call like python process_files.py -f file1.csv -f file2.csv -f file3.csv.

That's a bit ugly right now, I'm sure some thought could make it more elegant.

Summary

That's a bit of a brain-dump and I'm sorry if it's not totally coherent. I thought I'd document my experience with adding this kind of functionality on in case it informs the requirements for adding this into DVC itself.

I'd summarise my suggestion by:

Define metajobs which can generate DVC jobs for batches of available input data
Let these metajobs identify which input data they should be called with, and what output this step should produce in response
Have metajobs generate normal DVC jobs for all possible combinations of inputs (according to the metajobs' specs)
Process the resulting network of DVC jobs in the normal way, committing the both metajob DVC files and the generated DVC files to the repo (metajobs would contain a hash of all the input filenames / output dvc jobs).

0 replies

tdunning · 2021-08-24T21:05:36Z

tdunning
Aug 24, 2021

I am late to this party, but I think that a better way to deal with this would be to start with a file containing names of files to be processed.

A processing step would look at this inventory file and compare it to the inventory file recording files processed (letting that file survive would be a change). All files mentioned in the inventory file and not mentioned in the processed inventory would require incremental processing. The output of the processing step would be a new output inventory to be used in the next incremental step.

Version controlling the input, processed inventory and output files would be done outside of the context of the processing step rather than marking them as dependencies or outputs of the processing step. Only the input inventory would be a dependency and only the output inventory would be an output of the step. Back fill could be triggered by a (version controlled) change to the processed inventory.

This approach also allows for the relationship between input files and output files to not be one-to-one. This is useful if a sketch of all files processed in a single batch is put out as a single file. Since sketches are typically much smaller than the overall data, large numbers of sketches can typically be combined non-incrementally for global aggregates.

The only extensions here would be

a capability to build an input inventory from a directory (this could be a processing step with the directory as an input and the results of ls as the output)
a capability to declare an incrementally updated state output that is read at the beginning of a processing step and checked in after the step (this might just be done with an explicit check-in after the processing step completes if DVC is too opinionated about deleting old output files).
a capability to check in all new files in an output directory. Again, this could easily be done explicitly rather than automagically.

It would be desirable if memoization were possible. Thus if some branch somewhere has processed the same input batch with the same code, we should be allowed to assume that the outputs will be the same and simply check them out. The effect of this would be that back-fill with a trivial new version of a processing step that actually doesn't change semantics (think formatting or comment changes) could proceed nearly instantaneously. Likewise, if you have a new version of a DAG you want to stage, those aspects of work that are identical to the production version could be short-circuited by checking out the output files from the production runs.

Does this make sense?

0 replies

tdunning · 2021-08-25T15:21:23Z

tdunning
Aug 25, 2021

Additional discussion here: https://discuss.dvc.org/t/need-to-build-non-ml-data-pipeline-is-dvc-good-fit/849/8

Net-net, I think that this can be done by allowing small state files for processing steps and a helper that updates a "new files" output. These are not a circular dependency since they

do not compromise the topological sort for task ordering
are unwound by the version control on the state

As such, reproducibility and idempotency are not compromised.

Further, there is no bright line between data preparation or engineering and machine learning. Speaking from a LOT of experience, getting both right and both integrated is really crucial and DVC could really shine as an integrated path for both tasks. Speaking based on recent conversations in $dayjob, Pachyderm will eat DVC's lunch without something like this.

7 replies

dberenbaum Sep 1, 2021

@RomanSteinberg Good to know and thanks for the correction to the dvc.yaml command!

If I'm understanding correctly, the first two disadvantages both come back to having a dynamic way to generate the pipeline stages. There have been discussions around ideas like a DAG API (cc @skshetry):

It seems that something like this might solve a lot of the issues? For now, building your own script or using a full-featured templating library like jinja to generate a longer, more complex dvc.yaml might work even if the output is ugly.

I'm not sure I fully understand the last disadvantage. Even for static workflows, it's up to the user to map dependencies to outputs in both dvc.yaml and in their code. Is there some additional complexity for incremental processing, or is it just that there are more dependencies and outputs? Can you have some utility function that can repeat this across all of your scripts?

RomanSteinberg Sep 3, 2021

@dberenbaum I think we should determine the difference between static and dynamic workflows. I'm not sure that I understand the difference correctly.

I'm not sure I fully understand the last disadvantage. Even for static workflows, it's up to the user to map dependencies to outputs in both dvc.yaml and in their code.

It is better to have this mapping only in one place. Now, you have to support both mapping and control that they are the same. Otherwise you will get errors. Its like copy-paste problem when one have to synchronize changes in two pieces of copy-paste code not to get errors. If it's still not clear I can provide more detailed example.

dberenbaum Sep 3, 2021

I think we should determine the difference between static and dynamic workflows.

I meant that if there were something like a Python API to generate the DAG (like in #5646) so that you could have flexibility to write loops however you want, would that be enough?

It is better to have this mapping only in one place. Now, you have to support both mapping and control that they are the same. Otherwise you will get errors. Its like copy-paste problem when one have to synchronize changes in two pieces of copy-paste code not to get errors. If it's still not clear I can provide more detailed example.

Yes, it's definitely a potential source of error, and I think it applies to many more basic workflows than incremental processing. Unfortunately, this is a hard problem to solve generally in DVC since there is no control over what's happening in the user's code.

RomanSteinberg Sep 8, 2021

I meant that if there were something like a Python API to generate the DAG (like in #5646) so that you could have flexibility to write loops however you want, would that be enough?

I would prefer to strict variety of loops which are proposed to be implemented in dvc. I the past I spent more then 10 years programming the compilers and loops optimization in them. So, my experience is shouting that dvc should not allow to write any "loops you want". Simply, consider the for each, the standard for and the while loops. Their compiler implementation complexity is quite different, but for each allows to implement a great variety of algorithms.

Yes, it's definitely a potential source of error, and I think it applies to many more basic workflows than incremental processing. Unfortunately, this is a hard problem to solve generally in DVC since there is no control over what's happening in the user's code.

I think, no. Let's consider simple pipeline with some stages in it: data1->stage1->data2->stage2->data3->... So, in this case you can make a params.yaml where you write all data paths like

data1: first.csv
data2: second.csv

Now both the python code and the dvc could reference to those parameters. And there is no problem I described, i.e. there is no need to have two algorithms to construct data files names.

dberenbaum Sep 10, 2021

Thanks @RomanSteinberg, that helps.

I'll defer to your experience on optimization, where unfortunately DVC has a long way to go. Parallel execution of stages is not even handled automatically yet, which is probably a prerequisite for incremental processing at any kind of scale. However, I'm only suggesting giving flexibility to write any loops to define stages in the pipeline, not to execute those stages. Optimization of the DAG itself could be independent.

Let's consider simple pipeline with some stages in it: data1->stage1->data2->stage2->data3->... So, in this case you can make a params.yaml where you write all data paths like
data1: first.csv
data2: second.csv

Would it help to have a way to access info about the current DVC stage (for example, paths of dependencies and outputs)?

Also, if there were a Python API for generating the DAG, you might be able to use the same utility function to map inputs to outputs when generating the DAG and when executing stages (I understand this might not be your preference 😃 ).

oadams · 2023-10-18T04:34:58Z

oadams
Oct 18, 2023

I almost always work on ML projects that involve a growing dataset. Preprocessing steps are often long running and the preprocessing of each file is typically independent.

Scenario:

data/raw/ contains a large and growing number of files.
process.py is a script that processes a single file.
data/processed/ contains the preprocessed files. Each of these files corresponds to one in data/raw/.

Here is one workaround:

Create a script update_file_list.py which writes to a params.yaml file the stems of the filenames of files in data/raw/. E.g. contents of params.yaml could be:

file_prefixes:
  - a
  - b
  - c

Have your dvc.yaml loop over the file_prefixes variable:

stages:
  process:
    foreach: ${file_prefixes}
    do:
      cmd: python process.py data/raw/${item}.txt data/processed/${item}.txt
      deps:
        - data/raw/${item}.txt
      outs:
        - data/processed/${item}.txt

If anyone happens to have a more elegant workaround for this scenario, please share it as I face this situation all the time.

The downside of this approach is that to run dvc repro you need to run the list creation step python update_file_list.py. Variations include:

You alias the combo. E.g. alias run='python update_file_list.py; dvc repro'
Use a wrapper script
Add stage to your dvc.yaml that runs the update file list script and use always_changed: true to ensure it always runs. But then you need to run dvc repro twice since the first run updates params.yaml and the second is needed to read in the new file list.

Whichever path you take it's quite a lot of boilerplate for a common and simple scenario. It also doesn't seem to reach far beyond the existing capabilities of DVC, since you can almost get there with the existing DVC YAML. Without knowing about the DVC implementation, I imagine there are a number of possible ways the YAML could directly support this, including maybe having foreach be able to take a path value with the semantics be that DVC iterates over the files in that directory. E.g:

stages:
  process:
    foreach: data/processed
    do:
      cmd: python process.py data/raw/${item.prefix}.txt data/processed/${item.prefix}.txt
      deps:
        - data/raw/${item}.txt
      outs:
        - data/processed/${item}.txt

Or perhaps I'm wrong and there's some showstopper that makes this not feasible? There are probably better and terser ways than this too but that's just the natural thing that came to mind.

3 replies

oadams Oct 20, 2023

I've made a work-in-progress pull request for something along these lines.

dberenbaum Oct 20, 2023

How many files do you tend to have? Did you try having a single pipeline stage and checking in your script which files have been added/changed? Just trying to get a sense of the pros and cons of different approaches here. For example, see #9431 as an alternative approach to the problem.

oadams Oct 22, 2023

Hi there.

Number of files vary a lot but usually tend to be in the thousands.

In the past I have used simple approaches to check which files have been added. E.g. when I know that files in data/raw won't be modified, but new files will be added, then it's easy enough to just check if the corresponding target output file exists before doing processing. However, checking for modifications to dependencies is more involved and error-prone and is really the main reason motivating my use DVC pipelines in the first place.

As for #9431, that proposal to make it easy to ask DVC about which files have changed would certainly simplify doing the check for file modifications in the process script. But I think there is still an issue here: now the processing script needs to know about DVC and there's no longer a clean separation of concerns. Pretty much all the pipeline scripts I ever write will then have to ask DVC what's changed and include associated logic, introducing a lot of boilerplate. It would do the job but it would be less error-prone and more modular if the process script does a specific processing step and doesn't need to know about DVC and the pipeline.

Incremental processing or streaming in micro-batches #5917

Uh oh!

Replies: 47 comments · 10 replies

Uh oh!

dmpetrov Dec 5, 2017 Maintainer

Uh oh!

kskyten Dec 11, 2017 Author

Uh oh!

dmpetrov Dec 12, 2017 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmpetrov Feb 5, 2019 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shcheklein Nov 20, 2019 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shcheklein Jun 4, 2020 Maintainer

Uh oh!

Uh oh!

shcheklein Jun 6, 2020 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 47 comments 10 replies

dmpetrov
Dec 5, 2017
Maintainer

kskyten
Dec 11, 2017
Author

dmpetrov
Dec 12, 2017
Maintainer

dmpetrov
Feb 5, 2019
Maintainer

shcheklein
Nov 20, 2019
Maintainer

shcheklein
Jun 4, 2020
Maintainer

shcheklein
Jun 6, 2020
Maintainer