Formatting input data

jmewasiuk
New Member

Posts: 23

Formatting input data Nov 26, 2016 19:53:06 GMT

Quote

Post by jmewasiuk on Nov 26, 2016 19:53:06 GMT

Hi all,

So I have a preliminary model in tensorflow for a multilayered LSTM cell network.

Where I'm at now is formatting input data.

An LSTM network basically processes, sequences of chunks.

So for our case I figure that

scene (converation) == whole sequence
each line == a chunk

so a scene is a sequence of lines

I think we want to create inputs (x's) and targets (t's) as such:

1. t's : depends on which character in a script we are training for, so we need a "target id" -> name of the character should be good

2. Format our sequence training to train from start to finish of a scene(conversation).
    Let tn = # of lines spoken by target_id(character
    Let lti = length of line i spoken by target_id(character)

    Then for each scene we will have a number of "subscenes" to train the model on the probability of the next
    word from our vocabulary to choose given the sequence of words already used in this scene

    So... (not sure if I got all the indices accurately... let me know if you don't quite understand the sequence)
    subscene 1: x1 = the chunks(lines) of words up to target_id's 1st line, t1 = 1st word in target_id's 1st line
    subscene 2: x2 = x1 + t1, t2 = 2nd word in target_id's 1st line
    ...
    subscene k-1: x_k-1 = the chunks(lines) of words up to target_id's k-1th line + sum[0..lti-1]{ target_id's k-1th line}, t_k-1 = lti'th word in target_ids' k-1'th line
    subscene k: xk = the chunks(lines) of words up to target_id's kth line, tk = 1st word in target_id's kth line
    ...
    stop when we reach the end of target_id's last line in the scene

I figure with this type of training, we can give our model some line of text, then for some maximum variance we can generate a number of lines (sentences) of varying lengths by looping through each previously generated sequence of words and appending the next predicted word. We can have a stopping condition where we track the probability of the last appended word and then stop the generation when the probability of all words is below some percent to indicate that we have reached "end of thought".

That all said, our raw input data needs to be consistently formatted so we can automate the separation of scenes, lines and characters that spoke the lines.

Forgot to add: I'm still working on coding up running of each epoch to train the model I have set up.

Last Edit: Nov 26, 2016 20:15:04 GMT by jmewasiuk

jmewasiuk New Member Posts: 23	Formatting input data Nov 27, 2016 0:01:54 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by jmewasiuk on Nov 27, 2016 0:01:54 GMT Also: Here is my code so far punch.cs.sfu.ca/svn/CMPT419-1167-g-theforum/Main/project

jmewasiuk
New Member

Posts: 23

Formatting input data Nov 27, 2016 7:09:11 GMT

Quote

Post by jmewasiuk on Nov 27, 2016 7:09:11 GMT

I've merged my latest into main so you all can try to figure out the input data in parallel.

I've got some basic parsing of input data with this shape

input_data = [ [], [] ]
where
input_data[0] is an array of word(encoded at numbers from a vocabulary look-up table) arrays
input_data[1] is an array of length 1, of the target word(encoded again as a number)

My current issue is trying to shape the data so that it's padded properly

I've set the max conversation length to 500... so I think I need to pad everything to that length with 0's. Right now I only have it padded up to the largest conversation length found in the data, which isn't working when I try to pass it into the network.

jmewasiuk New Member Posts: 23	Formatting input data Nov 27, 2016 7:13:01 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by jmewasiuk on Nov 27, 2016 7:13:01 GMT ok.... so I _thought_ it went in... apparently subversion is down right now so I can't share it right now. I'll put it in tomorrow.

jmewasiuk New Member Posts: 23	Formatting input data Nov 27, 2016 16:47:52 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by jmewasiuk on Nov 27, 2016 16:47:52 GMT the latest is in SVN now

Umar Administrator Posts: 15	Formatting input data Nov 28, 2016 8:01:28 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Umar on Nov 28, 2016 8:01:28 GMT I posted some info about input data in the other thread.

Ml-group

Formatting input data

Post by jmewasiuk on Nov 26, 2016 19:53:06 GMT

Post by jmewasiuk on Nov 27, 2016 0:01:54 GMT

Post by jmewasiuk on Nov 27, 2016 7:09:11 GMT

Post by jmewasiuk on Nov 27, 2016 7:13:01 GMT

Post by jmewasiuk on Nov 27, 2016 16:47:52 GMT

Post by Umar on Nov 28, 2016 8:01:28 GMT

Quick Reply