Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any iterator supporting multi variable-length outputs in Pytorch? #2007

Open
Approximetal opened this issue Jun 8, 2020 · 17 comments
Open

Any iterator supporting multi variable-length outputs in Pytorch? #2007

Approximetal opened this issue Jun 8, 2020 · 17 comments

Comments

@Approximetal
Copy link

@Approximetal Approximetal commented Jun 8, 2020

Hi, I met issues when using DALI to replace Pytorch dataloader. My task is to transfer mel spectrograms with several labels into training model. I firstly hope to transfer a python class with labels (including IDs and strings) and GPU tensors into pipeline, but here said ExternalSource accepts input only on the CPU (via numpy array) so I changed the data format into several numpy arrays. Then I found the iterator (DALIGenericIterator and DALIClassificationIterator) support only one or two outputs in single pipeline, but the batch in my model contains at least 6 (mel_inputs, input_lengths, mel_target, output_lengths, speaker_id, gate_padded). And each mel spectrogram has a different length.

I would like to ask how can I transfer these data into a GPU based training model? Do I need to write a custom function to support this? If yes, may I have some guidance?

Here is my code:

class ExternalInputIterator(object):
    def __init__(self, batch_size, data_folder, target_speaker):
        self.batch_size = batch_size
        self.filepaths_pair = load_files_from_path(data_folder, target_speaker)
        self.dataset_len = len(self.filepaths_pair)

    def __iter__(self):
        self.i = 0
        shuffle(self.filepaths_pair)
        return self

    def __next__(self):
        inputs = []
        if self.i >= self.dataset_len:
            raise StopIteration

        for i in range(self.batch_size):
            file = self.get_mel_pair(self.filepaths_pair[i])
            inputs.append(file)
            batch = self.collate_fn(inputs)
            self.i = (self.i + 1) % self.dataset_len
        return batch

    @property
    def size(self,):
        return self.dataset_len

    def get_mel_pair(self, files):
        ...
        return files

    def load_mel(self, filename):
        melspec = np.load(filename)
        return melspec

    def collate_fn(self, batch):
        ...
        return (mel_inputs, input_lengths, mel_targets, output_lengths, gate, input_voice, target_voice, speaker_id)

    next = __next__

class ExternalSourcePipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id,  external_data):
        super(ExternalSourcePipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
        self.mel_inputs = ops.ExternalSource()
        self.input_lengths = ops.ExternalSource()
        self.mel_targets = ops.ExternalSource()
        self.output_lengths = ops.ExternalSource()
        self.speaker_id = ops.ExternalSource()
        self.gate_padded = ops.ExternalSource()
        self.input_voice = ops.ExternalSource()
        self.target_voice = ops.ExternalSource()

        self.pad = ops.Pad(fill_value=0)
        self.gate_pad = ops.Pad(fill_value=1)
        self.external_data = external_data
        self.iterator = iter(self.external_data)

    def define_graph(self):

        self.mel_inputs = self.mel_inputs()
        # mel_inputs = self.mel_inputs
        mel_inputs = self.pad(self.mel_inputs)

        self.mel_targets = self.mel_targets()
        # mel_targets = self.mel_targets
        mel_targets = self.pad(self.mel_targets)

        self.gate_padded = self.gate_padded()
        # gate_padded = self.gate_padded
        gate_padded = self.gate_pad(self.gate_padded)

        self.input_lengths = self.input_lengths()
        self.output_lengths = self.output_lengths()
        self.speaker_id = self.speaker_id()
        self.input_voice = self.input_voice()
        self.target_voice = self.target_voice()

        return (mel_inputs, self.input_lengths, mel_targets, self.output_lengths, gate_padded,
                self.input_voice, self.target_voice, self.speaker_id)

    def iter_setup(self):
        try:
            (mel_inputs, input_lengths, mel_targets, output_lengths, gate_padded, input_voice, target_voice, speaker_id) = self.iterator.next()

            self.feed_input(self.mel_inputs, mel_inputs)
            self.feed_input(self.input_lengths, input_lengths)
            self.feed_input(self.mel_targets, mel_targets)
            self.feed_input(self.output_lengths, output_lengths)
            self.feed_input(self.speaker_id, speaker_id)
            self.feed_input(self.gate_padded, gate_padded)
            self.feed_input(self.input_voice, input_voice)
            self.feed_input(self.target_voice, target_voice)

        except StopIteration:
            self.iterator = iter(self.external_data)
            raise StopIteration

if __name__ == '__main__':

    trainset_loader = ExternalInputIterator(batch_size=4, data_folder=hparams.data_folder, target_speaker=hparams.target_speaker)
    pipe = ExternalSourcePipeline(batch_size=4, num_threads=2, device_id=0, external_data=trainset_loader)
    dali_iter = DALIGenericIterator([pipe], ['mel_inputs', 'input_lengths', 'mel_targets', 'output_lengths', 'gate_padded', 'input_voice', 'target_voice', 'speaker_id'],
                                     size=trainset_loader.size, auto_reset=True, last_batch_padded = True)
    print('dataset size:{}, batch size:{}'.format(trainset_loader.size, 4))
    for e in range(10):
        for i, data in enumerate(dali_iter):
            if i % 10 == 0:
                print('epoch {}, iteration {}'.format(e, i))
                print('mel_inputs:', data[0]['mel_inputs'].shape)
                print('input_lengths:', list(data[0]['input_lengths']))
                print('mel_targets:', data[0]['mel_targets'].shape)
                print('output_lengths:', list(data[0]['output_lengths']))
                print('speaker_id:', list(data[0]['speaker_id']))
        dali_iter.reset()
@JanuszL JanuszL added the question label Jun 8, 2020
@JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jun 8, 2020

Hi,
ExternalSource accepts CPU only input, numpy, or anything that supports array_interface.
There is ongoing work to enable a GPU input as well #1997.
Regarding the number of the DALIGenericIterator outputs, it can be any. In the example, we show how to use two, but you can add more in the same way. Just add more values in the output_map.

@Approximetal
Copy link
Author

@Approximetal Approximetal commented Jun 8, 2020

Hi @JanuszL, thank you for your reply. When I run the code, I got this error:

Traceback (most recent call last):
  File "/media/zzy/D/Programs/pycharm-2019.2.5/helpers/pydev/pydevd.py", line 1415, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/media/zzy/D/Programs/pycharm-2019.2.5/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/media/zzy/E/DL/torch-code/parrotron/test.py", line 15, in <module>
    size=trainset_loader.size, auto_reset=True, fill_last_batch = True, last_batch_padded = False)
  File "/home/zzy/anaconda3/envs/StarGAN-VC/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 162, in __init__
    self._first_batch = self.next()
  File "/home/zzy/anaconda3/envs/StarGAN-VC/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 259, in next
    return self.__next__()
  File "/home/zzy/anaconda3/envs/StarGAN-VC/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 190, in __next__
    category_tensors[category] = out.as_tensor()
RuntimeError: [/opt/dali/dali/pipeline/data/tensor_list.h:435] Assert on "this->IsDenseTensor()" failed: All tensors in the input TensorList must have the same shape and be densely packed.

In a batch the data shape is like np.array([[80 * 218], [80 * 156], [80 * 131], [80 * 109]]). It seems the iterator doesn't support variable-length data. I tried to use self.pad = ops.Pad(fill_value=0), but I got another error:

RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/generic/pad.cc:172] Unsupported data type: 10
...
Current pipeline object is no longer valid.
@JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jun 8, 2020

That is true, DALI doesn't support variable-length data in PyTorch. However, PaddlPaddle and MXNet (Gluon) have such support. TensorFlow supports this only for CPU.

@JanuszL JanuszL added the enhancement label Jun 8, 2020
@JanuszL JanuszL added this to ToDo in Users requests via automation Jun 8, 2020
@Approximetal Approximetal changed the title Any iterator supporting multi variable length outputs in one pipeline? Any iterator supporting multi variable-length outputs in Pytorch? Jun 8, 2020
@Approximetal
Copy link
Author

@Approximetal Approximetal commented Jun 9, 2020

Another question is ops.Pad function cannot be used to pad the variable-length data. As describe above:

In a batch the data shape is like np.array([[80 * 218], [80 * 156], [80 * 131], [80 * 109]]). I tried to use self.pad = ops.Pad(fill_value=0), but I got another error:

RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/generic/pad.cc:172] Unsupported data type: 10
...
Current pipeline object is no longer valid.

What's the reason for this error? @JanuszL

@JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jun 9, 2020

It seems that you are trying to pad on the datatype that is not supported by the pad operator.
@jantonguirao any thoughts?

@jantonguirao
Copy link
Contributor

@jantonguirao jantonguirao commented Jun 9, 2020

@Approximetal Most of our operators don't support float64. I'd recommend using 32-bit floats (float) instead to feed your external source inputs

@Approximetal
Copy link
Author

@Approximetal Approximetal commented Jun 10, 2020

Hi @jantonguirao, I checked the data, it is float32, the format is like this [np.array(shape=(80,216)), np.array(shape=(80,209)), np.array(shape=(80,193)), np.array(shape=(80,116))]
I use pad function self.pad = ops.Pad(fill_value=0, axes=(2,)) it still return the error above. Is there any issue in my code?

The Documentation said the parameter is data (TensorList) – Input to the operator. but the error shows TypeError: expected np.ndarray. Is it confusing?

@jantonguirao
Copy link
Contributor

@jantonguirao jantonguirao commented Jun 10, 2020

@Approximetal The error message you shared with us before points to the fact that the input of Pad operator was in float64 format, which is also the default in numpy:

>>> arr = np.zeros(shape=(2, 2))
>>> print(arr.dtype)
float64

vs

>>> arr = np.zeros(shape=(2, 2), dtype=np.float32)
>>> print(arr.dtype)
float32

The other TypeError message is probably not coming from Pad operator but from somewhere else.

If you are able to share a reproducible code sample (with some sample data) we could analyze and find what's the problem.

@JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jun 10, 2020

@Approximetal

The Documentation said the parameter is data (TensorList) – Input to the operator. but the error shows TypeError: expected np.ndarray. Is it confusing?

Do you refer to:

RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/generic/pad.cc:172] Unsupported data type: 10
...
Current pipeline object is no longer valid.

error?

@Approximetal
Copy link
Author

@Approximetal Approximetal commented Jun 10, 2020

@Approximetal

The Documentation said the parameter is data (TensorList) – Input to the operator. but the error shows TypeError: expected np.ndarray. Is it confusing?

Do you refer to:

RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/generic/pad.cc:172] Unsupported data type: 10
...
Current pipeline object is no longer valid.

error?

Not this one, I mean, I don't know why the pad function cannot use, so I checked the documentation, and it said the data format should be a TensorList, So I use torch.from_numpy to transfer the data then got TypeError: expected np.ndarray. Maybe I just misunderstood.

@JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jun 10, 2020

I don't get it, torch.from_numpy creates a Torch tensor and DALI cannot handle it as an input, maybe you want to use .numpy method?

@Approximetal
Copy link
Author

@Approximetal Approximetal commented Jun 10, 2020

I don't get it, torch.from_numpy creates a Torch tensor and DALI cannot handle it as an input, maybe you want to use .numpy method?

Just want to confirm the
description in the documentation means a list of numpy array right? If DALI can only handle numpy array, that's okay. I just misunderstood, nothing matters, thanks reply~

@JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jun 10, 2020

Sure, it can be either a list of batch size numpy arrays, where each array corresponds to one tensor, or one nympy array where outermost dimension corresponds to the batch size.

@Approximetal
Copy link
Author

@Approximetal Approximetal commented Jun 10, 2020

@Approximetal The error message you shared with us before points to the fact that the input of Pad operator was in float64 format, which is also the default in numpy:

>>> arr = np.zeros(shape=(2, 2))
>>> print(arr.dtype)
float64

vs

>>> arr = np.zeros(shape=(2, 2), dtype=np.float32)
>>> print(arr.dtype)
float32

The other TypeError message is probably not coming from Pad operator but from somewhere else.

If you are able to share a reproducible code sample (with some sample data) we could analyze and find what's the problem.

@jantonguirao I found the dtype of the id info is int64, I changed it to float32, and it works. Thanks~

@Approximetal
Copy link
Author

@Approximetal Approximetal commented Jun 10, 2020

Hi, I met a new error RuntimeError: [/opt/dali/dali/python/backend_impl.cc:138] Assert on "info.strides[i] == info.itemsize*dim_prod" failed: Strided data not supported. Detected on dimension 1

The code is like:
input_voice = np.concatenate([frame for frame, _ in input_voice], axis=0)
where frame is a shape=[60,80] numpy array, input_voice is a concatenated numpy array with float32.

What does this error mean? How can I change the data format?

@JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jun 10, 2020

It looks that your NumPy array has strides and the memory is not contiguous. You can try to copy the array and see if the strides have been changes, see more info in for example this place.

@Approximetal
Copy link
Author

@Approximetal Approximetal commented Jun 10, 2020

It looks that your NumPy array has strides and the memory is not contiguous. You can try to copy the array and see if the strides have been changes, see more info in for example this place.

I use np.ascontiguousarray() to solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants