Any iterator supporting multi variable-length outputs in Pytorch? #2007

Approximetal · 2020-06-08T11:17:03Z

Hi, I met issues when using DALI to replace Pytorch dataloader. My task is to transfer mel spectrograms with several labels into training model. I firstly hope to transfer a python class with labels (including IDs and strings) and GPU tensors into pipeline, but here said ExternalSource accepts input only on the CPU (via numpy array) so I changed the data format into several numpy arrays. Then I found the iterator (DALIGenericIterator and DALIClassificationIterator) support only one or two outputs in single pipeline, but the batch in my model contains at least 6 (mel_inputs, input_lengths, mel_target, output_lengths, speaker_id, gate_padded). And each mel spectrogram has a different length.

I would like to ask how can I transfer these data into a GPU based training model? Do I need to write a custom function to support this? If yes, may I have some guidance?

Here is my code:

class ExternalInputIterator(object):
    def __init__(self, batch_size, data_folder, target_speaker):
        self.batch_size = batch_size
        self.filepaths_pair = load_files_from_path(data_folder, target_speaker)
        self.dataset_len = len(self.filepaths_pair)

    def __iter__(self):
        self.i = 0
        shuffle(self.filepaths_pair)
        return self

    def __next__(self):
        inputs = []
        if self.i >= self.dataset_len:
            raise StopIteration

        for i in range(self.batch_size):
            file = self.get_mel_pair(self.filepaths_pair[i])
            inputs.append(file)
            batch = self.collate_fn(inputs)
            self.i = (self.i + 1) % self.dataset_len
        return batch

    @property
    def size(self,):
        return self.dataset_len

    def get_mel_pair(self, files):
        ...
        return files

    def load_mel(self, filename):
        melspec = np.load(filename)
        return melspec

    def collate_fn(self, batch):
        ...
        return (mel_inputs, input_lengths, mel_targets, output_lengths, gate, input_voice, target_voice, speaker_id)

    next = __next__

class ExternalSourcePipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id,  external_data):
        super(ExternalSourcePipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
        self.mel_inputs = ops.ExternalSource()
        self.input_lengths = ops.ExternalSource()
        self.mel_targets = ops.ExternalSource()
        self.output_lengths = ops.ExternalSource()
        self.speaker_id = ops.ExternalSource()
        self.gate_padded = ops.ExternalSource()
        self.input_voice = ops.ExternalSource()
        self.target_voice = ops.ExternalSource()

        self.pad = ops.Pad(fill_value=0)
        self.gate_pad = ops.Pad(fill_value=1)
        self.external_data = external_data
        self.iterator = iter(self.external_data)

    def define_graph(self):

        self.mel_inputs = self.mel_inputs()
        # mel_inputs = self.mel_inputs
        mel_inputs = self.pad(self.mel_inputs)

        self.mel_targets = self.mel_targets()
        # mel_targets = self.mel_targets
        mel_targets = self.pad(self.mel_targets)

        self.gate_padded = self.gate_padded()
        # gate_padded = self.gate_padded
        gate_padded = self.gate_pad(self.gate_padded)

        self.input_lengths = self.input_lengths()
        self.output_lengths = self.output_lengths()
        self.speaker_id = self.speaker_id()
        self.input_voice = self.input_voice()
        self.target_voice = self.target_voice()

        return (mel_inputs, self.input_lengths, mel_targets, self.output_lengths, gate_padded,
                self.input_voice, self.target_voice, self.speaker_id)

    def iter_setup(self):
        try:
            (mel_inputs, input_lengths, mel_targets, output_lengths, gate_padded, input_voice, target_voice, speaker_id) = self.iterator.next()

            self.feed_input(self.mel_inputs, mel_inputs)
            self.feed_input(self.input_lengths, input_lengths)
            self.feed_input(self.mel_targets, mel_targets)
            self.feed_input(self.output_lengths, output_lengths)
            self.feed_input(self.speaker_id, speaker_id)
            self.feed_input(self.gate_padded, gate_padded)
            self.feed_input(self.input_voice, input_voice)
            self.feed_input(self.target_voice, target_voice)

        except StopIteration:
            self.iterator = iter(self.external_data)
            raise StopIteration

if __name__ == '__main__':

    trainset_loader = ExternalInputIterator(batch_size=4, data_folder=hparams.data_folder, target_speaker=hparams.target_speaker)
    pipe = ExternalSourcePipeline(batch_size=4, num_threads=2, device_id=0, external_data=trainset_loader)
    dali_iter = DALIGenericIterator([pipe], ['mel_inputs', 'input_lengths', 'mel_targets', 'output_lengths', 'gate_padded', 'input_voice', 'target_voice', 'speaker_id'],
                                     size=trainset_loader.size, auto_reset=True, last_batch_padded = True)
    print('dataset size:{}, batch size:{}'.format(trainset_loader.size, 4))
    for e in range(10):
        for i, data in enumerate(dali_iter):
            if i % 10 == 0:
                print('epoch {}, iteration {}'.format(e, i))
                print('mel_inputs:', data[0]['mel_inputs'].shape)
                print('input_lengths:', list(data[0]['input_lengths']))
                print('mel_targets:', data[0]['mel_targets'].shape)
                print('output_lengths:', list(data[0]['output_lengths']))
                print('speaker_id:', list(data[0]['speaker_id']))
        dali_iter.reset()

The text was updated successfully, but these errors were encountered:

JanuszL · 2020-06-08T11:24:14Z

Hi,
ExternalSource accepts CPU only input, numpy, or anything that supports array_interface.
There is ongoing work to enable a GPU input as well #1997.
Regarding the number of the DALIGenericIterator outputs, it can be any. In the example, we show how to use two, but you can add more in the same way. Just add more values in the output_map.

Approximetal · 2020-06-08T12:28:22Z

Hi @JanuszL, thank you for your reply. When I run the code, I got this error:

Traceback (most recent call last):
  File "/media/zzy/D/Programs/pycharm-2019.2.5/helpers/pydev/pydevd.py", line 1415, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/media/zzy/D/Programs/pycharm-2019.2.5/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/media/zzy/E/DL/torch-code/parrotron/test.py", line 15, in <module>
    size=trainset_loader.size, auto_reset=True, fill_last_batch = True, last_batch_padded = False)
  File "/home/zzy/anaconda3/envs/StarGAN-VC/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 162, in __init__
    self._first_batch = self.next()
  File "/home/zzy/anaconda3/envs/StarGAN-VC/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 259, in next
    return self.__next__()
  File "/home/zzy/anaconda3/envs/StarGAN-VC/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line 190, in __next__
    category_tensors[category] = out.as_tensor()
RuntimeError: [/opt/dali/dali/pipeline/data/tensor_list.h:435] Assert on "this->IsDenseTensor()" failed: All tensors in the input TensorList must have the same shape and be densely packed.

In a batch the data shape is like np.array([[80 * 218], [80 * 156], [80 * 131], [80 * 109]]). It seems the iterator doesn't support variable-length data. I tried to use self.pad = ops.Pad(fill_value=0), but I got another error:

RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/generic/pad.cc:172] Unsupported data type: 10
...
Current pipeline object is no longer valid.

JanuszL · 2020-06-08T12:39:47Z

That is true, DALI doesn't support variable-length data in PyTorch. However, PaddlPaddle and MXNet (Gluon) have such support. TensorFlow supports this only for CPU.

Approximetal · 2020-06-09T02:48:43Z

Another question is ops.Pad function cannot be used to pad the variable-length data. As describe above:

In a batch the data shape is like np.array([[80 * 218], [80 * 156], [80 * 131], [80 * 109]]). I tried to use self.pad = ops.Pad(fill_value=0), but I got another error:

RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/generic/pad.cc:172] Unsupported data type: 10
...
Current pipeline object is no longer valid.

What's the reason for this error? @JanuszL

JanuszL · 2020-06-09T08:45:12Z

It seems that you are trying to pad on the datatype that is not supported by the pad operator.
@jantonguirao any thoughts?

jantonguirao · 2020-06-09T15:36:43Z

@Approximetal Most of our operators don't support float64. I'd recommend using 32-bit floats (float) instead to feed your external source inputs

Approximetal · 2020-06-10T06:50:25Z

Hi @jantonguirao, I checked the data, it is float32, the format is like this [np.array(shape=(80,216)), np.array(shape=(80,209)), np.array(shape=(80,193)), np.array(shape=(80,116))]
I use pad function self.pad = ops.Pad(fill_value=0, axes=(2,)) it still return the error above. Is there any issue in my code?

The Documentation said the parameter is data (TensorList) – Input to the operator. but the error shows TypeError: expected np.ndarray. Is it confusing?

jantonguirao · 2020-06-10T07:54:49Z

@Approximetal The error message you shared with us before points to the fact that the input of Pad operator was in float64 format, which is also the default in numpy:

>>> arr = np.zeros(shape=(2, 2))
>>> print(arr.dtype)
float64

vs

>>> arr = np.zeros(shape=(2, 2), dtype=np.float32)
>>> print(arr.dtype)
float32

The other TypeError message is probably not coming from Pad operator but from somewhere else.

If you are able to share a reproducible code sample (with some sample data) we could analyze and find what's the problem.

JanuszL · 2020-06-10T08:41:28Z

@Approximetal

The Documentation said the parameter is data (TensorList) – Input to the operator. but the error shows TypeError: expected np.ndarray. Is it confusing?

Do you refer to:

RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/generic/pad.cc:172] Unsupported data type: 10
...
Current pipeline object is no longer valid.

error?

Approximetal · 2020-06-10T08:45:42Z

@Approximetal

The Documentation said the parameter is data (TensorList) – Input to the operator. but the error shows TypeError: expected np.ndarray. Is it confusing?

Do you refer to:
RuntimeError: Critical error in pipeline: [/opt/dali/dali/operators/generic/pad.cc:172] Unsupported data type: 10
...
Current pipeline object is no longer valid.
error?

Not this one, I mean, I don't know why the pad function cannot use, so I checked the documentation, and it said the data format should be a TensorList, So I use torch.from_numpy to transfer the data then got TypeError: expected np.ndarray. Maybe I just misunderstood.

JanuszL · 2020-06-10T08:50:47Z

I don't get it, torch.from_numpy creates a Torch tensor and DALI cannot handle it as an input, maybe you want to use .numpy method?

Approximetal · 2020-06-10T09:02:00Z

I don't get it, torch.from_numpy creates a Torch tensor and DALI cannot handle it as an input, maybe you want to use .numpy method?

Just want to confirm the
in the documentation means a list of numpy array right? If DALI can only handle numpy array, that's okay. I just misunderstood, nothing matters, thanks reply~

JanuszL · 2020-06-10T09:08:40Z

Sure, it can be either a list of batch size numpy arrays, where each array corresponds to one tensor, or one nympy array where outermost dimension corresponds to the batch size.

Approximetal · 2020-06-10T09:47:17Z

@Approximetal The error message you shared with us before points to the fact that the input of Pad operator was in float64 format, which is also the default in numpy:
>>> arr = np.zeros(shape=(2, 2))
>>> print(arr.dtype)
float64
vs
>>> arr = np.zeros(shape=(2, 2), dtype=np.float32)
>>> print(arr.dtype)
float32
The other TypeError message is probably not coming from Pad operator but from somewhere else.

If you are able to share a reproducible code sample (with some sample data) we could analyze and find what's the problem.

@jantonguirao I found the dtype of the id info is int64, I changed it to float32, and it works. Thanks~

Approximetal · 2020-06-10T11:11:03Z

Hi, I met a new error RuntimeError: [/opt/dali/dali/python/backend_impl.cc:138] Assert on "info.strides[i] == info.itemsize*dim_prod" failed: Strided data not supported. Detected on dimension 1

The code is like:
input_voice = np.concatenate([frame for frame, _ in input_voice], axis=0)
where frame is a shape=[60,80] numpy array, input_voice is a concatenated numpy array with float32.

What does this error mean? How can I change the data format?

JanuszL · 2020-06-10T11:30:55Z

It looks that your NumPy array has strides and the memory is not contiguous. You can try to copy the array and see if the strides have been changes, see more info in for example this place.

Approximetal · 2020-06-10T11:33:30Z

It looks that your NumPy array has strides and the memory is not contiguous. You can try to copy the array and see if the strides have been changes, see more info in for example this place.

I use np.ascontiguousarray() to solve the problem.

JanuszL added the question label Jun 8, 2020

JanuszL added the enhancement label Jun 8, 2020

JanuszL added this to ToDo in Users requests via automation Jun 8, 2020

Approximetal changed the title ~~Any iterator supporting multi variable length outputs in one pipeline?~~ Any iterator supporting multi variable-length outputs in Pytorch? Jun 8, 2020

Approximetal mentioned this issue Jun 15, 2020

DALIGenericIterator repeat first batch as output. #2018

Closed

NVIDIA / DALI

Any iterator supporting multi variable-length outputs in Pytorch? #2007

Any iterator supporting multi variable-length outputs in Pytorch? #2007

Approximetal commented Jun 8, 2020 •

edited

JanuszL commented Jun 8, 2020

Approximetal commented Jun 8, 2020 •

edited

JanuszL commented Jun 8, 2020

Approximetal commented Jun 9, 2020 •

edited

JanuszL commented Jun 9, 2020

jantonguirao commented Jun 9, 2020

Approximetal commented Jun 10, 2020 •

edited

jantonguirao commented Jun 10, 2020

JanuszL commented Jun 10, 2020

Approximetal commented Jun 10, 2020

JanuszL commented Jun 10, 2020

Approximetal commented Jun 10, 2020 •

edited

JanuszL commented Jun 10, 2020

Approximetal commented Jun 10, 2020

Approximetal commented Jun 10, 2020

JanuszL commented Jun 10, 2020

Approximetal commented Jun 10, 2020

NVIDIA / DALI

Any iterator supporting multi variable-length outputs in Pytorch? #2007

Any iterator supporting multi variable-length outputs in Pytorch? #2007

Comments

Approximetal commented Jun 8, 2020 • edited

JanuszL commented Jun 8, 2020

Approximetal commented Jun 8, 2020 • edited

JanuszL commented Jun 8, 2020

Approximetal commented Jun 9, 2020 • edited

JanuszL commented Jun 9, 2020

jantonguirao commented Jun 9, 2020

Approximetal commented Jun 10, 2020 • edited

jantonguirao commented Jun 10, 2020

JanuszL commented Jun 10, 2020

Approximetal commented Jun 10, 2020

JanuszL commented Jun 10, 2020

Approximetal commented Jun 10, 2020 • edited

JanuszL commented Jun 10, 2020

Approximetal commented Jun 10, 2020

Approximetal commented Jun 10, 2020

JanuszL commented Jun 10, 2020

Approximetal commented Jun 10, 2020

Approximetal commented Jun 8, 2020 •

edited

Approximetal commented Jun 8, 2020 •

edited

Approximetal commented Jun 9, 2020 •

edited

Approximetal commented Jun 10, 2020 •

edited

Approximetal commented Jun 10, 2020 •

edited