AI News, Overview

Overview

where tags are obtained from supervised or unsupervised morphological tagger Example of target file with snowball segmentation: чтобы|ы|чтоб восстановить|ить|восстанов поддержку|у|поддержк латинской|ой|латинск америки|и|америк –|NULL|– и|NULL|и понизить|ить|пониз популярность|ость|популярн чавеса|а|чавес –|NULL|– администрации|ии|администрац буша|а|буш нужно|о|нужн гораздо|о|горазд больше|е|больш ,|NULL|, чем|NULL|чем короткий|ий|коротк визит|ит|виз .|NULL|.

google/sentencepiece

number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

and “.” is dropped from the tokenized sequence, since e.g., Tokenize(“World.”) == Tokenize(“World .”) SentencePiece treats the input text just as a sequence of Unicode characters.

To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol '▁' (U+2581) as follows.

Then, this text is segmented into small pieces, for example: [Hello] [▁Wor] [ld] [.] Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.

virtually augments training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

For more detail, see Python module The following tools and libraries are required to build SentencePiece: On Ubuntu, autotools can be installed with apt-get: The name of the protobuf library is different between ubuntu distros.

On ubuntu 14.04 LTS (Trusty Tahr): On ubuntu 16.04 LTS (Xenial Xerus): On ubuntu 17.10 (Artful Aardvark) and Later: On OSX, you can use brew: If want to use self-prepared protobuf library, specify protbof prefix before build: On OSX/macOS, replace the last command with the following: ``%

By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively.

spm_encode accepts a --vocabulary and a --vocabulary_threshold option so that spm_encode will only produce symbols which also appear in the vocabulary (with at least some frequency).

Assming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each: shuffle command is used just in case because spm_train loads the first 10M lines of corpus by default.

Step by Step, A Tutorial on How to Feed Your Own Image Data to Tensorflow

' % len(current_folder_filename_list)) print('Please be noted that only files end with '*.tfrecord' will be load!') tfrecord_list = list_tfrecord_file(current_folder_filename_list) if len(tfrecord_list) != 0: for list_index in xrange(len(tfrecord_list)): print(tfrecord_list[list_index]) else: print('Cannot find any tfrecord files, please check the path.') return tfrecord_listdef main(): tfrecord_list = tfrecord_auto_traversal()if __name__ == '__main__': main() After we got this program, we no longer need to list all the tfrecord files manually.

All Rights Reserved.## Licensed under the Apache License, Version 2.0 (the 'License');# you may not use this file except in compliance with the License.# You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an 'AS IS' BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# ==============================================================================from __future__ import absolute_importfrom __future__ import divisionfrom __future__ import print_functionfrom datetime import datetimeimport osimport randomimport sysimport threadingimport numpy as npimport tensorflow as tffrom PIL import Imagetf.app.flags.DEFINE_string('train_directory', './', 'Training data directory')tf.app.flags.DEFINE_string('validation_directory', '', 'Validation data directory')tf.app.flags.DEFINE_string('output_directory', './', 'Output data directory')tf.app.flags.DEFINE_integer('train_shards', 4, 'Number of shards in training TFRecord files.')tf.app.flags.DEFINE_integer('validation_shards', 0, 'Number of shards in validation TFRecord files.')tf.app.flags.DEFINE_integer('num_threads', 4, 'Number of threads to preprocess the images.')# The labels file contains a list of valid labels are held in this file.# Assumes that the file contains entries as such:# dog# cat# flower# where each line corresponds to a label.

self._decode_jpeg_data = tf.placeholder(dtype=tf.string) self._decode_jpeg = tf.image.decode_jpeg(self._decode_jpeg_data, channels=3) def png_to_jpeg(self, image_data): return self._sess.run(self._png_to_jpeg, feed_dict={self._png_data: image_data}) def decode_jpeg(self, image_data): image = self._sess.run(self._decode_jpeg, feed_dict={self._decode_jpeg_data: image_data}) assert len(image.shape) == 3 assert image.shape[2] == 3 return imagedef _is_png(filename): '''Determine if a file contains a PNG format image.

image = coder.decode_jpeg(image_data) print(tf.Session().run(tf.shape(image))) # image = tf.Session().run(tf.image.resize_image_with_crop_or_pad(image, 128, 128))# image_data = tf.image.encode_jpeg(image)# img = Image.fromarray(image, 'RGB')# img.save(os.path.join('./re_steak/'+str(i)+'.jpeg'))# i = i+1 # Check that image converted to RGB assert len(image.shape) == 3 height = image.shape[0] width = image.shape[1] assert image.shape[2] == 3 return image_data, height, widthdef _process_image_files_batch(coder, thread_index, ranges, name, filenames, texts, labels, num_shards): '''Processes and saves list of images as TFRecord in 1 thread.

num_threads = len(ranges) assert not num_shards % num_threads num_shards_per_batch = int(num_shards / num_threads) shard_ranges = np.linspace(ranges[thread_index][0], ranges[thread_index][1], num_shards_per_batch + 1).astype(int) num_files_in_thread = ranges[thread_index][1] - ranges[thread_index][0] counter = 0 for s in xrange(num_shards_per_batch): # Generate a sharded version of the file name, e.g.

'train-00002-of-00010' shard = thread_index * num_shards_per_batch + s output_filename = '%s-%.2d-of-%.2d.tfrecord' % (name, shard, num_shards) output_file = os.path.join(FLAGS.output_directory, output_filename) writer = tf.python_io.TFRecordWriter(output_file) shard_counter = 0 files_in_shard = np.arange(shard_ranges[s], shard_ranges[s + 1], dtype=int) for i in files_in_shard: filename = filenames[i] label = labels[i] text = texts[i] image_buffer, height, width = _process_image(filename, coder) example = _convert_to_example(filename, image_buffer, label, text, height, width) writer.write(example.SerializeToString()) shard_counter += 1 counter += 1 print(counter) if not counter % 1000: print('%s [thread %d]: Processed %d of %d images in thread batch.' % (datetime.now(), thread_index, counter, num_files_in_thread)) sys.stdout.flush() print('%s [thread %d]: Wrote %d images to %s' % (datetime.now(), thread_index, shard_counter, output_file)) sys.stdout.flush() shard_counter = 0 print('%s [thread %d]: Wrote %d images to %d shards.' % (datetime.now(), thread_index, counter, num_files_in_thread)) sys.stdout.flush()def _process_image_files(name, filenames, texts, labels, num_shards): '''Process and save list of images as TFRecord of Example protos.

'image/encoded': tf.FixedLenFeature([], tf.string), 'image/height': tf.FixedLenFeature([], tf.int64), 'image/width': tf.FixedLenFeature([], tf.int64), 'image/filename': tf.FixedLenFeature([], tf.string), 'image/class/label': tf.FixedLenFeature([], tf.int64),}) image_encoded = features['image/encoded'] image_raw = tf.image.decode_jpeg(image_encoded, channels=3) current_image_object = image_object() current_image_object.image = tf.image.resize_image_with_crop_or_pad(image_raw, FLAGS.image_height, FLAGS.image_width) # cropped image with size 299x299# current_image_object.image = tf.cast(image_crop, tf.float32) * (1./255) - 0.5 current_image_object.height = features['image/height'] # height of the raw image current_image_object.width = features['image/width'] # width of the raw image current_image_object.filename = features['image/filename'] # filename of the raw image current_image_object.label = tf.cast(features['image/class/label'], tf.int32) # label of the raw image return current_image_objectfilename_queue = tf.train.string_input_producer( tfrecord_auto_traversal(), shuffle = True)current_image_object = read_and_decode(filename_queue)with tf.Session() as sess: sess.run(tf.initialize_all_variables()) coord = tf.train.Coordinator() threads = tf.train.start_queue_runners(coord=coord) print('Write cropped and resized image to the folder './resized_image'') for i in range(FLAGS.image_number): # number of examples in your tfrecord pre_image, pre_label = sess.run([current_image_object.image, current_image_object.label]) img = Image.fromarray(pre_image, 'RGB') if not os.path.isdir('./resized_image/'): os.mkdir('./resized_image') img.save(os.path.join('./resized_image/class_'+str(pre_label)+'_Index_'+str(i)+'.jpeg')) if i % 10 == 0: print ('%d images in %d has finished!'

Tips on Building Neural Machine Translation Systems

by Graham Neubig (Nara Institute of Science and Technology/Carnegie Mellon University) This tutorial will explain some practical tips about how to train a neural machine translation system.

Note that this will not cover the theory behind NMT in detail, nor is it a survey meant to cover all the work on neural MT, but it will show you how to use lamtram, and also demonstrate some things that you have to do in order to make a system that actually works well (focusing on ones that are implemented in my toolkit).

Enter the nmt-tips directory and make a link to the directory in which you installed lamtram: Machine translation is a method for translating from a source sequence F with words f_1, ..., f_J to a target sequence E with words e_1, ..., e_I.

It is also common to generate h_j using bidirectional neural networks, where we run one forward RNN that reads from left-to-right, and another backward RNN that reads from right to left, then concatenate the representations for each word.

Next, we generate a word in the output by performing a softmax over the target vocabulary to predict the probability of each word in the output, parameterized by Φ_esm: We then pick the word that has highest probability: We then update the hidden state with this predicted value: This process is continued until a special 'end of sentence' symbol is chosen for e'_i.

This is done by maximizing the log likelihood of the training data: or equivalently minimizing the negative log likelihood: The standard way we do this minimization is through stochastic gradient descent (SGD), where we calculate the gradient of the negative log probability for a single example <F,E>

then update the parameters based on an update rule: The most standard update rule simply subtracts the gradient of the negative log likelihood multiplied by a learning rate γ Let's try to do this with lamtram.

First make a directory to hold the model: then train the model with the following commands: Here, model_type indicates that we want to train an encoder-decoder, train_src and train_trg indicate the source and target training data files.

If training is going well, we will be able to see the following log output: ppl is reporting perplexity on the training set, which is equal to the exponent of the per-word negative log probability: For this perplexity, lower is better, so if it's decreasing we're learning something.

If we set NUM_WORDS to be equal to 256, and re-run the previous command, we get the following log: Looking at the w/s (words per second) on the right side of the log, we can see that we're processing data 4 times faster than before, nice!

Try re-running the following command: You'll probably find that the perplexity drops significantly faster than when using the standard SGD update (after the first epoch, I had a perplexity of 287 with standard SGD, and 233 with Adam).

In order to express this, attention calculates a 'context vector' c_i that is used as input to the softmax in addition to the decoder state: This context vector is defined as the sum of the input sequence vectors h_j, weighted by an attention vector a_i as follows: There are a number of ways to calculate the attention vector a_i (described in detail below), but all follow a basic pattern of calculating an attention score α_{i,j} for every word that is a function of g_i and h_j: and then use a softmax function to convert score vector α_i into an attention vector a_i that adds to one.

When I ran these, I got a perplexity of 19 for the encoder-decoder, and a perplexity of 11 for the attentional model, demonstrating that it's a bit easier for the attentional model to model the training corpus correctly.

Note here that we're actually getting perplexities that are much worse for the test set than we did on the training set (I got train/test perplexities of 19/118 for the encdec model and 11/112 for the encatt model).

Next, let's try to actually generate translations of the input using the following command (likewise for the attentional model by swapping encdec into encatt): We can then measure the accuracy of this model using a measure called BLEU score (Papineni et al.

2002), which measures the similarity between the translation generated by the model and a reference translation created by a human (data/test.en): This gave me a BLEU score of 1.76 for encdec and 2.17 for encatt, which shows that we're getting something.

Then we can re-train the attentional model using this new data: This greatly helps our accuracy on the test set: when I measured the perplexity and BLEU score on the test set, this gave me 58 and 3.32 respectively, a bit better than before!

(You may also want to increase the number of training epochs to 20 or so to really witness how much the model overfits in later stages of training.) You'll notice that now after every pass over the training data, we're measuring the perplexity on the development set, and the model is written out only when the perplexity is its best value yet.

Without a lexicon, when an unknown word is predicted in the target, the NMT system will find the word in the source sentence with the highest alignment weight a_j and map it into the target as-is.

The advantage of this method is that the lexicon is fast to train, and also contains information about what words can be translated into others in an efficient manner, making it easier for the MT system to learn correct translations, particularly of rare words.

'alpha' is a parameter to adjust the strength of the lexicon, where smaller indicates that more weight will be put on the lexicon probabilities: In my running, this improves our perplexity from 57 to 37, and BLEU score from 2.48 to 8.83, nice!

To improve search (and hopefully translation accuracy), we can use 'beam search,' which instead of considering the one best next word, considers the k best hypotheses at every time step i.

k can be set with the --beam option during decoding, so let's try this here with our best model so far: where we replace the two instances of BEAM above with values such as 1, 2, 3, 5.

There are a number of ways to fix this problem, but the easiest is adding a 'word penalty' wp which multiplies the probability of the sentence by the constant 'e^{wp}' every time an additional word is added.

wp can be set using the --word_pen option of lamtram, so let's try setting a few different values and measure the BLEU score for beam width of 10: We can see that as we increase the word penalty, this gives us more reasonably-lengthed hypotheses, which also improves the BLEU a little bit.

These can be changed using the --layers option of lamtram, which defaults to lstm:100:1, where the first option is using LSTM networks (which tend to work pretty well), the second option is the width, and third option is the depth.

The way this works is that if we have two probability distributions pe_i^{(1)} and pe_i^{(2)} from multiple models, we can calculate the next probability by linearly interpolating them together: or log-linearly interpolating them together: Performing ensembling at test time in lamtram is simple: in --models_in, we simply add two different model options separated by a pipe, as follows.

The following are a few extra methods that can be pretty useful in some cases, but I won't be testing here: As mentioned before, when dealing with small data we need to worry about overfitting, and some ways to fix this are ealy stopping and learning rate decay.

Sometimes in these cases, you'll want to evaluate the accuracy of your system more frequently than when you reach the end of the corpus, so try specifying the --eval_every NUM_SENTENCES command, where NUM_SENTENCES is the number of sentences after which you'd like to evaluate on the dev set.

This is a very fast-moving field, so this guide might be obsolete in a few months from the writing (or even already!) but hopefully this helped you learn the basics to get started, start reading papers, and come up with your own methods/applications.

Word2Vec and FastText Word Embedding with Gensim

A traditional way of representing words is one-hot vector, which is essentially a vector with only one target element being 1 and the others being 0.

Namely, you should expect the one-hot vectors for words starting with “a” with target “1” of lower index, while those for words beginning with “z” with target “1” of higher index.

For instance, in the sentence “I have a cute dog”, the input would be “a”, whereas the output is “I”, “have”, “cute”, and “dog”, assuming the window size is 5.

The network contains 1 hidden layer whose dimension is equal to the embedding size, which is smaller than the input/ output vector size.

At the end of the output layer, a softmax activation function is applied so that each element of the output vector describes how likely a specific word will appear in the context.

The vectors obtained by subtracting two related words sometimes express a meaningful concept such as gender or verb tense, as shown in the following figure (dimensionality reduced).

To compute the word representation for the word “a”, we need to feed in these two examples, “He is nice guy”, and “She is wise queen” into the Neural Network and take the average of the value in the hidden layer.

Easy Image Classification with Tensorflow

In this coding tutorial, learn how to use Google's Tensorflow machine learning framework to develop a simple image classifier with object recognition and neural ...

Tips, Tricks & Techniques for Model Kit Building | Video Workbench

Scale model kit building is a hobby that takes years to master, so why not learn from a seasoned enthusiast and hobbyist? In this comprehensive instructional ...

WMC - Season 1 Episode 2 - Train in Vain

Women's Murder Club - Season 1 Episode 2 - Train in Vain Lindsay and Jacobi hunt down a killer who committed a triple homicide on a subway car, and ...

Sept. 17, 2018 - Law Amendments Committee Proceedings

Guidelines for Use The Speaker of the Nova Scotia House of Assembly grants permission to record and use the audio and video of the proceedings of the ...

Live and (Machine) Learn: Cognitive Services and Vue.js : Build 2018

The life we live online increasingly informs the way we live offline as well. Businesses live and die through algorithms like SEO, humans are sorted in ...

Opening Keynote (GDD India '17)

Hear about the latest news and updates to Google's developer products and platforms in the GDD India '17 keynote. Pankaj Gupta, Anitha Vijayakumar, Tal ...

WebGuild March 2007: Product Team Analytics

Users now control what to watch, where, and when to watch it. Google's acquisition of YouTube validated online video as a revolutionary new advertising ...

Fiori Training with SAP WebIDE | Fiori Tutorial | Fiori Learning Map

SAP Fiori Training With WebIDE Please contact us for SAP UI5+Fiori training @ or write to me for details ..

The Manchurian Candidate (2004)

Academy Award® winners Denzel Washington and Meryl Streep, along with Golden Globe and Emmy nominee Liev Schreiber, mesmerize a whole new ...

Microsoft Azure OpenDev 10.2017

Welcome to the 2nd edition of our community-focused, recurring online series designed to showcase open source technologies and customer solutions on ...