How to pass 2-dimentional sequence to LSTM? #90

ganji15 · 2016-09-18T17:00:09Z

I found that given examples can process 1-dimentional sequence data with RNN by passing the input sequence to a embeding-layer, then the input sequence is transformed to a 2-dimentional data sequence(word vector matrix I guess).
However, sometimes input sequence is 2-dimentional , such as (time_step, 1-dimentional feature_vector) <=> [ [f, f, f, f], [f, f, f, f], ..., [f, f, f, f]]. So how can I directly put the 2-dimentional sequence to RNN/LSTM?

I tried following method, but I failed.

DataProvider.py

from paddle.trainer.PyDataProvider2 import *

def hook(settings, input_dim, num_class, is_train, **kwargs):
    settings.input_types = [
                dense_vector_sequence(int(input_dim)),
                integer_value_sequence(int(num_class))]
    settings.is_train = is_train

@provider(init_hook=hook)
def processData(settings, file_name):
    seqs, labels = cPickle.load(open(file_name, 'rb'))
    indexs = list(range(len(labels)))
    if settings.is_train:
        random.shuffle(indexs)
    for i in indexs:
        seq = seqs[i]     # sequence of 1-dim fixed length vector
        label = labels[i]  # various length integer sequence
        yield seq, label

Error Information

I0919 00:57:43.780038 592 Util.cpp:138] commandline: /usr/local/bin/../opt/paddle/bin/paddle_trainer --config=OcrRecognition.py --dot_period=10 --log_period=10 --test_all_data_in_one_period=1 --use_gpu=1 --gpu_id=0 --trainer_count=1 --num_passes=100 --save_dir=./model
I0919 00:57:44.160080 592 Util.cpp:113] Calling runInitFunctions
I0919 00:57:44.160356 592 Util.cpp:126] Call runInitFunctions done.
[WARNING 2016-09-19 00:57:44,193 default_decorators.py:40] please use keyword arguments in paddle config.
[WARNING 2016-09-19 00:57:44,194 default_decorators.py:40] please use keyword arguments in paddle config.
[INFO 2016-09-19 00:57:44,195 networks.py:1122] The input order is [ocr_seq, label]
[INFO 2016-09-19 00:57:44,195 networks.py:1129] The output order is [ctc]
I0919 00:57:44.197438 592 Trainer.cpp:169] trainer mode: Normal
I0919 00:57:44.198976 592 PyDataProvider2.cpp:219] loading dataprovider dataprovider::processData
I0919 00:57:44.201690 592 PyDataProvider2.cpp:219] loading dataprovider dataprovider::processData
I0919 00:57:44.201730 592 GradientMachine.cpp:134] Initing parameters..
I0919 00:57:44.205186 592 GradientMachine.cpp:141] Init parameters done.
*** Aborted at 1474217866 (unix time) try "date -d @1474217866" if you are using GNU date ***
PC: @ 0x7fd4325a5767 (unknown)
*** SIGSEGV (@0x10) received by PID 592 (TID 0x7fd433536840) from PID 16; stack trace: ***
@ 0x7fd432e28330 (unknown)
@ 0x7fd4325a5767 (unknown)
@ 0x7fd43256c444 (unknown)
@ 0x7fd432644370 (unknown)
@ 0x7fd4325cf193 (unknown)
@ 0x7fd43261b3b7 (unknown)
@ 0x7fd4107d6da4 array_str
@ 0x7fd4325d258a (unknown)
@ 0x7fd4325d277a (unknown)
@ 0x7fd4108ab3dc gentype_repr
@ 0x7fd432624da0 (unknown)
@ 0x82abf9 paddle::py::repr()
@ 0x569eb1 paddle::IndexScanner::fill()
@ 0x56a2c1 paddle::SequenceScanner::fill()
@ 0x56d3fc paddle::PyDataProvider2::getNextBatchInternal()
@ 0x563982 paddle::DataProvider::getNextBatch()
@ 0x69b437 paddle::Trainer::trainOnePass()
@ 0x69ecc7 paddle::Trainer::train()
@ 0x53bf73 main
@ 0x7fd43144cf45 (unknown)
@ 0x5475b5 (unknown)
@ 0x0 (unknown)

/usr/local/bin/paddle: 行 81: 592 段错误 (核心已转储) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}

Can anyone provide any suggestions or examples? Thank you!

qingqing01 · 2016-09-19T02:24:21Z

I think the yield format is not correct.

The processData should yield one sample once. The format should be
[[f, ...], [f, ...], ...] for dense_vector_sequence, in which one [f,...] is one time step. The format should be [i, i, ...] for integer_value_sequence. You can refer to PyDataProvider2 documentation.

I'm not quite sure about your pickled data format, so sorry for not giving you the examples of processData.

reyoung · 2016-09-19T02:36:55Z

It seems that the label vector is not an int list. And this line is invoked.

ganji15 · 2016-09-19T03:14:17Z

@qingqing01
Here is my code of pickled data:

def gen_dataset(name, idxs):
        vals = []
        labels = []
        for idx in idxs:
            # data['x'][idx].transpose() is numpy.array with shape(time_step, 11)  and every diffenert
            # data[['x'][idx].transpose() has a different time_step.
            # then I try function tolist() to transform the maxtrix to list of list => [[f, ...], ..., [f...]]
            vals.append(data['x'][idx].transpose().tolist())
            # Similarily, list(data['y'][idx]) is 1-dimentional integer vector [i, ...] which has various length
            labels.append(list(data['y'][idx]))
        cPickle.dump((vals, labels), open(name, 'wb'))

So, I'm not sure how to process my data to match 'dense_vector_sequence' type and 'integer_value_sequence' type.

ganji15 · 2016-09-19T03:59:55Z

@reyoung I think you are right, but how does this error occur?

The following is my code:

dataprovider.py

def hook(settings, input_dim, num_class, is_train, **kwargs):
    settings.input_types = [
                dense_vector_sequence(int(input_dim)),
                integer_value_sequence(int(num_class))]
    settings.is_train = is_train

@provider(init_hook=hook)
def processData(settings, file_name):
    seqs, labels = cPickle.load(open(file_name, 'rb'))
    indexs = list(range(len(labels)))
    if settings.is_train:
        random.shuffle(indexs)
    for i in indexs:
        seq = seqs[i]    
        label = labels[i]  
        yield seq, label

my pickled data format

def gen_dataset(name, idxs):
        vals = []
        labels = []
        for idx in idxs:
            # data['x'][idx].transpose() is numpy.array with shape(time_step, 11)  and every diffenert
            # data[['x'][idx].transpose() has a different time_step.
            # then I try function tolist() to transform the maxtrix to list of list => [[f, ...], ..., [f...]]
            vals.append(data['x'][idx].transpose().tolist())
            # Similarily, list(data['y'][idx]) is 1-dimentional integer vector [i, ...] which has various length
            labels.append(list(data['y'][idx]))
        cPickle.dump((vals, labels), open(name, 'wb'))

reyoung · 2016-09-19T04:35:50Z

Please print type of labels, label, label[0] in your dataprovider.

print type(labels), type(label), type(label[0])

ganji15 · 2016-09-19T04:41:27Z

@reyoung

In [3]: seqs, lbs = cPickle.load(open('data/ocr_train.pkl', 'rb'))

In [4]: type(seqs)
Out[4]: list

In [5]: type(seqs[0])
Out[5]: list

In [6]: type(seqs[0][0])
Out[6]: list

In [7]: type(seqs[0][0][0])
Out[7]: float

In [8]: type(lbs[0][0])
Out[8]: numpy.int32

In [9]: type(lbs[0])
Out[9]: list

In [10]: lbs[0]
Out[10]: [70, 75, 4, 9, 31]

reyoung · 2016-09-19T04:42:07Z

@ganji15 numpy.int32 is not int object in python. Cast it to int please.

map(int, lbs[0])

ganji15 · 2016-09-19T04:51:37Z

@reyoung
It works! Thank you very much!

reyoung · 2016-09-19T04:55:02Z

@ganji15 The numpy will be support in a few days.

Thanks for your attention.

alvations · 2016-11-07T06:09:30Z

I have managed to feed numpy objects into Paddle by using something like np.array.tolist():

from paddle.trainer.PyDataProvider2 import *

import numpy as np

UNK_IDX = 2
START = "<s>"
END = "<e>"

def _get_ids(s, dictionary):
    words = s.strip().split()
    return [dictionary[START]] + \
           [dictionary.get(w, UNK_IDX) for w in words] + \
           [dictionary[END]]

def hook(settings, src_dict, trg_dict, file_list, **kwargs):
    # Some code ...
    # A numpy matrix that corresponds to the src (row) and target (column) vocabulary
    settings.thematrix = np.random.rand(len(src_dict), len(trg_dict))
    # ...
    settings.slots = [ integer_value_sequence(len(settings.src_dict)),
                           dense_vector_sequence(len(setting.src_dict)),
                            integer_value_sequence(len(settings.trg_dict))]
    # ...

@provider(init_hook=hook, pool_size=50000)
def process(settings, file_name):
    # ...
    for line in enumerate(f):
        src_seq, trg_seq = line.strip().split('\t')
        src_ids = _get_ids(src_seq, settings.src_dict)
        trg_ids = [settings.trg_dict.get(w, UNK_IDX)
                           for w in trg_words]
        trg_ids = [settings.trg_dict[START]] + trg_ids
    yield src_ids , settings.thematrix[src_ids].tolist(), trg_ids

Somehow the vectors can't seem to get pass the first batch and Paddle throws this error:

~/Paddle/demo/rowrow$ bash train.sh 
I1104 18:59:42.636052 18632 Util.cpp:151] commandline: /home/ltan/Paddle/binary/bin/../opt/paddle/bin/paddle_trainer --config=train.conf --save_dir=/home/ltan/Paddle/demo/rowrow/model --use_gpu=true --num_passes=100 --show_parameter_stats_period=1000 --trainer_count=4 --log_period=10 --dot_period=5 
I1104 18:59:46.503566 18632 Util.cpp:126] Calling runInitFunctions
I1104 18:59:46.503810 18632 Util.cpp:139] Call runInitFunctions done.
[WARNING 2016-11-04 18:59:46,847 default_decorators.py:40] please use keyword arguments in paddle config.
[INFO 2016-11-04 18:59:46,856 networks.py:1125] The input order is [source_language_word, target_language_word, target_language_next_word]
[INFO 2016-11-04 18:59:46,857 networks.py:1132] The output order is [__cost_0__]
I1104 18:59:46.871026 18632 Trainer.cpp:170] trainer mode: Normal
I1104 18:59:46.871906 18632 MultiGradientMachine.cpp:108] numLogicalDevices=1 numThreads=4 numDevices=4
I1104 18:59:46.988584 18632 PyDataProvider2.cpp:247] loading dataprovider dataprovider::process
[INFO 2016-11-04 18:59:46,990 dataprovider.py:15] src dict len : 45661
[INFO 2016-11-04 18:59:47,316 dataprovider.py:26] trg dict len : 422
I1104 18:59:47.347944 18632 PyDataProvider2.cpp:247] loading dataprovider dataprovider::process
[INFO 2016-11-04 18:59:47,348 dataprovider.py:15] src dict len : 45661
[INFO 2016-11-04 18:59:47,657 dataprovider.py:26] trg dict len : 422
I1104 18:59:47.658279 18632 GradientMachine.cpp:134] Initing parameters..
I1104 18:59:49.244287 18632 GradientMachine.cpp:141] Init parameters done.
F1104 18:59:50.485621 18632 PythonUtil.h:213] Check failed: PySequence_Check(seq_) 
*** Check failure stack trace: ***
    @     0x7f71f521adaa  (unknown)
    @     0x7f71f521ace4  (unknown)
    @     0x7f71f521a6e6  (unknown)
    @     0x7f71f521d687  (unknown)
    @           0x54dac9  paddle::DenseScanner::fill()
    @           0x54f1d1  paddle::SequenceScanner::fill()
    @           0x5543cc  paddle::PyDataProvider2::getNextBatchInternal()
    @           0x5779b2  paddle::DataProvider::getNextBatch()
    @           0x6a01f7  paddle::Trainer::trainOnePass()
    @           0x6a3b57  paddle::Trainer::train()
    @           0x53a2b3  main
    @     0x7f71f4426f45  (unknown)
    @           0x545ae5  (unknown)
    @              (nil)  (unknown)
/home/ltan/Paddle/binary/bin/paddle: line 81: 18632 Aborted                 (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}

More details on http://stackoverflow.com/questions/40421248/why-is-paddle-throwing-errors-when-feeding-in-a-dense-vector-sequence-to-a-seqto and the data+code that i use to run train.sh is in https://github.com/alvations/rowrow .

Is it just numpy vectors that are not supported yet? Or is it that Paddle hasn't support any dense vector sequence (even if it's list of list of floats) in native Python object?

reyoung · 2016-11-07T06:33:33Z

1、numpy is supported
2、dense_vector_sequence is a vector of dense_vector, which data type should be [[f, f, f], [f, f, f]]

reyoung · 2016-11-07T06:33:46Z

@alvations Please open another issue for new question.

reyoung · 2016-11-07T06:39:01Z

It seems that you should use dense_vector instead of dense_vector_sequence, because the settings.thematrix[src_ids] is just a vector of float.

alvations · 2016-11-07T06:47:26Z

Sorry for the inconvenience. I've created a new issue #369.

The settings.thematrix[src_ids] should return a matrix (vector of vectors) that fits the dense_vector_sequence [ [f,f,f], [f,f,f], ...] structure, right? :

>>> import numpy as np
>>> x = np.random.rand(10,5) # 10 rows, 5 columns
>>> x
array([[ 0.71414965,  0.45273671,  0.37954461,  0.04298937,  0.65297758],
       [ 0.71330836,  0.93355837,  0.91250145,  0.73036384,  0.00237625],
       [ 0.27265885,  0.01207583,  0.10584876,  0.64541483,  0.42509224],
       [ 0.15477619,  0.5713811 ,  0.71976755,  0.00669505,  0.7747009 ],
       [ 0.07513192,  0.20092001,  0.30176491,  0.98289236,  0.60552273],
       [ 0.4454395 ,  0.19612705,  0.47249998,  0.81235983,  0.35272056],
       [ 0.48687432,  0.91080766,  0.77938878,  0.45750021,  0.98119178],
       [ 0.70029773,  0.00784268,  0.56423129,  0.40237047,  0.86712586],
       [ 0.31193082,  0.60600517,  0.18091819,  0.3627252 ,  0.85459444],
       [ 0.32658941,  0.51335506,  0.29290611,  0.74307929,  0.87390234]])
>>> x[[0,2,5,8]] # get 4 rows
array([[ 0.71414965,  0.45273671,  0.37954461,  0.04298937,  0.65297758],
       [ 0.27265885,  0.01207583,  0.10584876,  0.64541483,  0.42509224],
       [ 0.4454395 ,  0.19612705,  0.47249998,  0.81235983,  0.35272056],
       [ 0.31193082,  0.60600517,  0.18091819,  0.3627252 ,  0.85459444]])

update 1.0.0 paddle submodule

* add SetIpuIndexStage for model sharding/pipelinging * add batches_per_step

* add nan_inf metric * return count for nan and inf --------- Co-authored-by: lihui53 <lihui53@MacBook-Pro-5.local>

reyoung closed this as completed Sep 19, 2016

qingqing01 added the question label Sep 21, 2016

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019

Merge pull request PaddlePaddle#90 from tink2123/update_paddle

2d5428d

update 1.0.0 paddle submodule

DemoMoon mentioned this issue Mar 24, 2021

oneDNN 如何能提升DeepSpeech的语音处理性能 #31838

Closed

thisjiang pushed a commit to thisjiang/Paddle that referenced this issue Oct 28, 2021

add debug test log (PaddlePaddle#90)

376fcf9

gglin001 pushed a commit to graphcore/Paddle-fork that referenced this issue Dec 8, 2021

add batches_per_step(python side) (PaddlePaddle#90)

a871ac1

* add SetIpuIndexStage for model sharding/pipelinging * add batches_per_step

wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021

Optimize the lice2021 baseline. (PaddlePaddle#90)

0138ffd

qingshui pushed a commit to jiaoxuewu/PaddleBox that referenced this issue Dec 1, 2023

return count of nan and inf (PaddlePaddle#90)

52950fc

* add nan_inf metric * return count for nan and inf --------- Co-authored-by: lihui53 <lihui53@MacBook-Pro-5.local>

lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024

Fix nas api unittest (PaddlePaddle#90)

043d3bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to pass 2-dimentional sequence to LSTM? #90

How to pass 2-dimentional sequence to LSTM? #90

ganji15 commented Sep 18, 2016 •

edited by reyoung

Loading

qingqing01 commented Sep 19, 2016 •

edited

Loading

reyoung commented Sep 19, 2016

ganji15 commented Sep 19, 2016 •

edited

Loading

ganji15 commented Sep 19, 2016 •

edited

Loading

reyoung commented Sep 19, 2016

ganji15 commented Sep 19, 2016

reyoung commented Sep 19, 2016

ganji15 commented Sep 19, 2016

reyoung commented Sep 19, 2016

alvations commented Nov 7, 2016 •

edited

Loading

reyoung commented Nov 7, 2016

reyoung commented Nov 7, 2016

reyoung commented Nov 7, 2016

alvations commented Nov 7, 2016 •

edited

Loading

How to pass 2-dimentional sequence to LSTM? #90

How to pass 2-dimentional sequence to LSTM? #90

Comments

ganji15 commented Sep 18, 2016 • edited by reyoung Loading

I tried following method, but I failed.

DataProvider.py

Error Information

Can anyone provide any suggestions or examples? Thank you!

qingqing01 commented Sep 19, 2016 • edited Loading

reyoung commented Sep 19, 2016

ganji15 commented Sep 19, 2016 • edited Loading

ganji15 commented Sep 19, 2016 • edited Loading

dataprovider.py

my pickled data format

reyoung commented Sep 19, 2016

ganji15 commented Sep 19, 2016

reyoung commented Sep 19, 2016

ganji15 commented Sep 19, 2016

reyoung commented Sep 19, 2016

alvations commented Nov 7, 2016 • edited Loading

reyoung commented Nov 7, 2016

reyoung commented Nov 7, 2016

reyoung commented Nov 7, 2016

alvations commented Nov 7, 2016 • edited Loading

ganji15 commented Sep 18, 2016 •

edited by reyoung

Loading

qingqing01 commented Sep 19, 2016 •

edited

Loading

ganji15 commented Sep 19, 2016 •

edited

Loading

ganji15 commented Sep 19, 2016 •

edited

Loading

alvations commented Nov 7, 2016 •

edited

Loading

alvations commented Nov 7, 2016 •

edited

Loading