-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to pass 2-dimentional sequence to LSTM? #90
Comments
I think the yield format is not correct. The I'm not quite sure about your pickled data format, so sorry for not giving you the examples of |
It seems that the |
@qingqing01 def gen_dataset(name, idxs):
vals = []
labels = []
for idx in idxs:
# data['x'][idx].transpose() is numpy.array with shape(time_step, 11) and every diffenert
# data[['x'][idx].transpose() has a different time_step.
# then I try function tolist() to transform the maxtrix to list of list => [[f, ...], ..., [f...]]
vals.append(data['x'][idx].transpose().tolist())
# Similarily, list(data['y'][idx]) is 1-dimentional integer vector [i, ...] which has various length
labels.append(list(data['y'][idx]))
cPickle.dump((vals, labels), open(name, 'wb')) So, I'm not sure how to process my data to match 'dense_vector_sequence' type and 'integer_value_sequence' type. |
@reyoung I think you are right, but how does this error occur? The following is my code: dataprovider.pydef hook(settings, input_dim, num_class, is_train, **kwargs):
settings.input_types = [
dense_vector_sequence(int(input_dim)),
integer_value_sequence(int(num_class))]
settings.is_train = is_train
@provider(init_hook=hook)
def processData(settings, file_name):
seqs, labels = cPickle.load(open(file_name, 'rb'))
indexs = list(range(len(labels)))
if settings.is_train:
random.shuffle(indexs)
for i in indexs:
seq = seqs[i]
label = labels[i]
yield seq, label my pickled data formatdef gen_dataset(name, idxs):
vals = []
labels = []
for idx in idxs:
# data['x'][idx].transpose() is numpy.array with shape(time_step, 11) and every diffenert
# data[['x'][idx].transpose() has a different time_step.
# then I try function tolist() to transform the maxtrix to list of list => [[f, ...], ..., [f...]]
vals.append(data['x'][idx].transpose().tolist())
# Similarily, list(data['y'][idx]) is 1-dimentional integer vector [i, ...] which has various length
labels.append(list(data['y'][idx]))
cPickle.dump((vals, labels), open(name, 'wb')) |
Please print type of labels, label, label[0] in your dataprovider.
|
In [3]: seqs, lbs = cPickle.load(open('data/ocr_train.pkl', 'rb')) In [4]: type(seqs) In [5]: type(seqs[0]) In [6]: type(seqs[0][0]) In [7]: type(seqs[0][0][0]) In [8]: type(lbs[0][0]) In [9]: type(lbs[0]) In [10]: lbs[0] |
@ganji15 numpy.int32 is not int object in python. Cast it to int please. map(int, lbs[0]) |
@reyoung |
@ganji15 The numpy will be support in a few days. Thanks for your attention. |
I have managed to feed numpy objects into Paddle by using something like from paddle.trainer.PyDataProvider2 import *
import numpy as np
UNK_IDX = 2
START = "<s>"
END = "<e>"
def _get_ids(s, dictionary):
words = s.strip().split()
return [dictionary[START]] + \
[dictionary.get(w, UNK_IDX) for w in words] + \
[dictionary[END]]
def hook(settings, src_dict, trg_dict, file_list, **kwargs):
# Some code ...
# A numpy matrix that corresponds to the src (row) and target (column) vocabulary
settings.thematrix = np.random.rand(len(src_dict), len(trg_dict))
# ...
settings.slots = [ integer_value_sequence(len(settings.src_dict)),
dense_vector_sequence(len(setting.src_dict)),
integer_value_sequence(len(settings.trg_dict))]
# ...
@provider(init_hook=hook, pool_size=50000)
def process(settings, file_name):
# ...
for line in enumerate(f):
src_seq, trg_seq = line.strip().split('\t')
src_ids = _get_ids(src_seq, settings.src_dict)
trg_ids = [settings.trg_dict.get(w, UNK_IDX)
for w in trg_words]
trg_ids = [settings.trg_dict[START]] + trg_ids
yield src_ids , settings.thematrix[src_ids].tolist(), trg_ids Somehow the vectors can't seem to get pass the first batch and Paddle throws this error: ~/Paddle/demo/rowrow$ bash train.sh
I1104 18:59:42.636052 18632 Util.cpp:151] commandline: /home/ltan/Paddle/binary/bin/../opt/paddle/bin/paddle_trainer --config=train.conf --save_dir=/home/ltan/Paddle/demo/rowrow/model --use_gpu=true --num_passes=100 --show_parameter_stats_period=1000 --trainer_count=4 --log_period=10 --dot_period=5
I1104 18:59:46.503566 18632 Util.cpp:126] Calling runInitFunctions
I1104 18:59:46.503810 18632 Util.cpp:139] Call runInitFunctions done.
[WARNING 2016-11-04 18:59:46,847 default_decorators.py:40] please use keyword arguments in paddle config.
[INFO 2016-11-04 18:59:46,856 networks.py:1125] The input order is [source_language_word, target_language_word, target_language_next_word]
[INFO 2016-11-04 18:59:46,857 networks.py:1132] The output order is [__cost_0__]
I1104 18:59:46.871026 18632 Trainer.cpp:170] trainer mode: Normal
I1104 18:59:46.871906 18632 MultiGradientMachine.cpp:108] numLogicalDevices=1 numThreads=4 numDevices=4
I1104 18:59:46.988584 18632 PyDataProvider2.cpp:247] loading dataprovider dataprovider::process
[INFO 2016-11-04 18:59:46,990 dataprovider.py:15] src dict len : 45661
[INFO 2016-11-04 18:59:47,316 dataprovider.py:26] trg dict len : 422
I1104 18:59:47.347944 18632 PyDataProvider2.cpp:247] loading dataprovider dataprovider::process
[INFO 2016-11-04 18:59:47,348 dataprovider.py:15] src dict len : 45661
[INFO 2016-11-04 18:59:47,657 dataprovider.py:26] trg dict len : 422
I1104 18:59:47.658279 18632 GradientMachine.cpp:134] Initing parameters..
I1104 18:59:49.244287 18632 GradientMachine.cpp:141] Init parameters done.
F1104 18:59:50.485621 18632 PythonUtil.h:213] Check failed: PySequence_Check(seq_)
*** Check failure stack trace: ***
@ 0x7f71f521adaa (unknown)
@ 0x7f71f521ace4 (unknown)
@ 0x7f71f521a6e6 (unknown)
@ 0x7f71f521d687 (unknown)
@ 0x54dac9 paddle::DenseScanner::fill()
@ 0x54f1d1 paddle::SequenceScanner::fill()
@ 0x5543cc paddle::PyDataProvider2::getNextBatchInternal()
@ 0x5779b2 paddle::DataProvider::getNextBatch()
@ 0x6a01f7 paddle::Trainer::trainOnePass()
@ 0x6a3b57 paddle::Trainer::train()
@ 0x53a2b3 main
@ 0x7f71f4426f45 (unknown)
@ 0x545ae5 (unknown)
@ (nil) (unknown)
/home/ltan/Paddle/binary/bin/paddle: line 81: 18632 Aborted (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2} More details on http://stackoverflow.com/questions/40421248/why-is-paddle-throwing-errors-when-feeding-in-a-dense-vector-sequence-to-a-seqto and the data+code that i use to run Is it just |
1、 |
@alvations Please open another issue for new question. |
It seems that you should use |
Sorry for the inconvenience. I've created a new issue #369. The >>> import numpy as np
>>> x = np.random.rand(10,5) # 10 rows, 5 columns
>>> x
array([[ 0.71414965, 0.45273671, 0.37954461, 0.04298937, 0.65297758],
[ 0.71330836, 0.93355837, 0.91250145, 0.73036384, 0.00237625],
[ 0.27265885, 0.01207583, 0.10584876, 0.64541483, 0.42509224],
[ 0.15477619, 0.5713811 , 0.71976755, 0.00669505, 0.7747009 ],
[ 0.07513192, 0.20092001, 0.30176491, 0.98289236, 0.60552273],
[ 0.4454395 , 0.19612705, 0.47249998, 0.81235983, 0.35272056],
[ 0.48687432, 0.91080766, 0.77938878, 0.45750021, 0.98119178],
[ 0.70029773, 0.00784268, 0.56423129, 0.40237047, 0.86712586],
[ 0.31193082, 0.60600517, 0.18091819, 0.3627252 , 0.85459444],
[ 0.32658941, 0.51335506, 0.29290611, 0.74307929, 0.87390234]])
>>> x[[0,2,5,8]] # get 4 rows
array([[ 0.71414965, 0.45273671, 0.37954461, 0.04298937, 0.65297758],
[ 0.27265885, 0.01207583, 0.10584876, 0.64541483, 0.42509224],
[ 0.4454395 , 0.19612705, 0.47249998, 0.81235983, 0.35272056],
[ 0.31193082, 0.60600517, 0.18091819, 0.3627252 , 0.85459444]]) |
update 1.0.0 paddle submodule
* add SetIpuIndexStage for model sharding/pipelinging * add batches_per_step
* add nan_inf metric * return count for nan and inf --------- Co-authored-by: lihui53 <lihui53@MacBook-Pro-5.local>
I found that given examples can process 1-dimentional sequence data with RNN by passing the input sequence to a embeding-layer, then the input sequence is transformed to a 2-dimentional data sequence(word vector matrix I guess).
However, sometimes input sequence is 2-dimentional , such as (time_step, 1-dimentional feature_vector) <=> [ [f, f, f, f], [f, f, f, f], ..., [f, f, f, f]]. So how can I directly put the 2-dimentional sequence to RNN/LSTM?
I tried following method, but I failed.
DataProvider.py
Error Information
I0919 00:57:43.780038 592 Util.cpp:138] commandline: /usr/local/bin/../opt/paddle/bin/paddle_trainer --config=OcrRecognition.py --dot_period=10 --log_period=10 --test_all_data_in_one_period=1 --use_gpu=1 --gpu_id=0 --trainer_count=1 --num_passes=100 --save_dir=./model
I0919 00:57:44.160080 592 Util.cpp:113] Calling runInitFunctions
I0919 00:57:44.160356 592 Util.cpp:126] Call runInitFunctions done.
[WARNING 2016-09-19 00:57:44,193 default_decorators.py:40] please use keyword arguments in paddle config.
[WARNING 2016-09-19 00:57:44,194 default_decorators.py:40] please use keyword arguments in paddle config.
[INFO 2016-09-19 00:57:44,195 networks.py:1122] The input order is [ocr_seq, label]
[INFO 2016-09-19 00:57:44,195 networks.py:1129] The output order is [ctc]
I0919 00:57:44.197438 592 Trainer.cpp:169] trainer mode: Normal
I0919 00:57:44.198976 592 PyDataProvider2.cpp:219] loading dataprovider dataprovider::processData
I0919 00:57:44.201690 592 PyDataProvider2.cpp:219] loading dataprovider dataprovider::processData
I0919 00:57:44.201730 592 GradientMachine.cpp:134] Initing parameters..
I0919 00:57:44.205186 592 GradientMachine.cpp:141] Init parameters done.
*** Aborted at 1474217866 (unix time) try "date -d @1474217866" if you are using GNU date ***
PC: @ 0x7fd4325a5767 (unknown)
*** SIGSEGV (@0x10) received by PID 592 (TID 0x7fd433536840) from PID 16; stack trace: ***
@ 0x7fd432e28330 (unknown)
@ 0x7fd4325a5767 (unknown)
@ 0x7fd43256c444 (unknown)
@ 0x7fd432644370 (unknown)
@ 0x7fd4325cf193 (unknown)
@ 0x7fd43261b3b7 (unknown)
@ 0x7fd4107d6da4 array_str
@ 0x7fd4325d258a (unknown)
@ 0x7fd4325d277a (unknown)
@ 0x7fd4108ab3dc gentype_repr
@ 0x7fd432624da0 (unknown)
@ 0x82abf9 paddle::py::repr()
@ 0x569eb1 paddle::IndexScanner::fill()
@ 0x56a2c1 paddle::SequenceScanner::fill()
@ 0x56d3fc paddle::PyDataProvider2::getNextBatchInternal()
@ 0x563982 paddle::DataProvider::getNextBatch()
@ 0x69b437 paddle::Trainer::trainOnePass()
@ 0x69ecc7 paddle::Trainer::train()
@ 0x53bf73 main
@ 0x7fd43144cf45 (unknown)
@ 0x5475b5 (unknown)
@ 0x0 (unknown)
/usr/local/bin/paddle: 行 81: 592 段错误 (核心已转储) ${DEBUGGER}$MYDIR/../opt/paddle/bin/paddle_trainer $ {@:2}
Can anyone provide any suggestions or examples? Thank you!
The text was updated successfully, but these errors were encountered: