我们即将用于此项目的模型与我在文章“亚马逊评论中的文本汇总”(https://medium.com/towards-data-science/text-summarization-with-amazon-reviews-41801c2210b)(都是seq2seq模型)中写的是很相似的,但是我添加了一些额外的代码行,以便可以使用grid search来调整整个架构和超参数,并且可以使用TensorBoard来分析结果。如果你想要更详细地演示如何在你的代码中添加TensorBoard,请查看“使用TensorFlow和TensorBoard预测Movie Review Sentiment”(https://medium.com/@Currie32/predicting-movie-review-sentiment-with-tensorflow-and-tensorboard-53bf16af0acf)。
本文的着重点将在于如何为模型准备数据,同时我还将讨论该模型的一些其他功能。我们将在此项目中使用Python 3和TensorFlow 1.1。数据是由古腾堡项目中的二十本流行书籍组成。如果你有兴趣扩大这个项目以使其更准确,那么你可以在古腾堡项目上下载数百本图书。此外,如果看到人们使用这种模式制作出的拼写检查器是多么的好用,那将是非常有趣的。
Spellin is difficult, whch is wyh you need to study everyday.
Spelling is difficult, which is why you need to study everyday.
The first days of her existence in th country were vrey hard for Dolly.
The first days of her existence in the country were very hard for Dolly.
Thi is really something impressiv thaat we should look into right away!
This is really something impressive that we should look into right away!
1 2 3 4 5 | def load_book(path): input_file = os.path.join(path) with open(input_file) as f: book = f.read() return book |
1 2 3 | path = './books/' book_files = [f for f in listdir(path) if isfile(join(path, f))] book_files = book_files[1:] |
1 2 3 | books = [] for book in book_files: books.append(load_book(path+book)) |
1 2 | for i in range(len(books)): print("There are {} words in {}.".format(len(books[i].split()), book_files[i])) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | def clean_text(text): '''Remove unwanted characters and extra spaces from the text''' text = re.sub(r'\n', ' ', text) text = re.sub(r'[{}@_*>()\\#%+=\[\]]','', text) text = re.sub('a0','', text) text = re.sub('\'92t','\'t', text) text = re.sub('\'92s','\'s', text) text = re.sub('\'92m','\'m', text) text = re.sub('\'92ll','\'ll', text) text = re.sub('\'91','', text) text = re.sub('\'92','', text) text = re.sub('\'93','', text) text = re.sub('\'94','', text) text = re.sub('\.','. ', text) text = re.sub('\!','! ', text) text = re.sub('\?','? ', text) text = re.sub(' +',' ', text) # Removes extra spaces return text |
1 2 | The vocabulary contains 78 characters. [' ', '!', '"', '$', '&', "'", ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<eos>', '<go>', '<pad>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] |
Today is a lovely day. I want to go to the beach. (这将被拆分为两个输入句子)
Is today a lovely day? I want to go to the beach. (这将是一个长的输入句子)
1 2 3 4 | sentences = [] for book in clean_books: for sentence in book.split('. '): sentences.append(sentence + '.') |
1 2 3 4 5 6 7 8 | max_length = 92 min_length = 10 good_sentences = [] for sentence in int_sentences: if len(sentence) <= max_length and len(sentence) >= min_length: good_sentences.append(sentence) |
1 2 3 | training, testing = train_test_split(good_sentences, test_size = 0.15, random_state = 2) |
1 2 3 4 5 6 7 8 9 10 | training_sorted = [] testing_sorted = [] for i in range(min_length, max_length+1): for sentence in training: if len(sentence) == i: training_sorted.append(sentence) for sentence in testing: if len(sentence) == i: testing_sorted.append(sentence) |
1 2 | letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m', 'n','o','p','q','r','s','t','u','v','w','x','y','z',] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | def noise_maker(sentence, threshold): noisy_sentence = [] i = 0 while i < len(sentence): random = np.random.uniform(0,1,1) if random < threshold: noisy_sentence.append(sentence[i]) else: new_random = np.random.uniform(0,1,1) if new_random > 0.67: if i == (len(sentence) - 1): continue else: noisy_sentence.append(sentence[i+1]) noisy_sentence.append(sentence[i]) i += 1 elif new_random < 0.33: random_letter = np.random.choice(letters, 1)[0] noisy_sentence.append(vocab_to_int[random_letter]) noisy_sentence.append(sentence[i]) else: pass i += 1 return noisy_sentence |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | def get_batches(sentences, batch_size, threshold): for batch_i in range(0, len(sentences)//batch_size): start_i = batch_i * batch_size sentences_batch = sentences[start_i:start_i + batch_size] sentences_batch_noisy = [] for sentence in sentences_batch: sentences_batch_noisy.append( noise_maker(sentence, threshold)) sentences_batch_eos = [] for sentence in sentences_batch: sentence.append(vocab_to_int['<eos>']) sentences_batch_eos.append(sentence) pad_sentences_batch = np.array( pad_sentence_batch(sentences_batch_eos)) pad_sentences_noisy_batch = np.array( pad_sentence_batch(sentences_batch_noisy)) pad_sentences_lengths = [] for sentence in pad_sentences_batch: pad_sentences_lengths.append(len(sentence)) pad_sentences_noisy_lengths = [] for sentence in pad_sentences_noisy_batch: pad_sentences_noisy_lengths.append(len(sentence)) yield (pad_sentences_noisy_batch, pad_sentences_batch, pad_sentences_noisy_lengths, pad_sentences_lengths) |
这就是整个这个项目!虽然结果是令人鼓舞的,但这种模式仍然存在着一定的局限性。我真的会很感激,如果有人可以扩大这个模型或改进其设计!如果你可以这样做,请在评论中发表一下。新设计的想法将会应用到Facebook AI实验室最新的CNN模型(https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/?utm_campaign=Artificial%2BIntelligence%2Band%2BDeep%2BLearning%2BWeekly&utm_medium=email&utm_source=Artificial_Intelligence_and_Deep_Learning_Weekly_13)中去(它可以获得最先进的翻译结果)。