AI News, Alternative Finance Data for Emerging Markets: Natural Language Processing (Part III)
- On Sunday, September 30, 2018
- By Read More
Alternative Finance Data for Emerging Markets: Natural Language Processing (Part III)
The following was written by CTO Malcolm Kapuza, who dives into some example code showing how alternative finance data in the Capital Finder is categorized, using machine learning and natural language processing.
In the second part, I discussed in detail where we source our different data points and how we turn this data into one coherent and intuitive view of developing world alternative finance.
By the end of this blog, you will learn all of the high level steps for creating your own data pipeline, including some of the more detailed steps of using Machine Learning to build a Natural Language Classifier.
The HTML tells us a surprising lot about the website in question. For example, HTML holds information about: You may be surprised by how much depth HTML adds to your understanding of a webpage. HTML forms the basis for how websites are Search Engine Optimized, and therefore categorized, throughout the web. We rank these different HTML aspects according to our view on their importance.
The text_from_html method makes use of Beautiful Soup’s built in findAll method to find all text and then uses the tag_visible method to filter out any content that is not supposed to be read by the website visitor.
In this case, the reason that we care only about the visible text is that we want to ensure that our Natural Language Classifier is reading the page in the same way that a normal website user would be.
As an exercise, however, think about how you might modify the code to account for the difference between a title tag and a paragraph tag, for instance.
One of our most straightforward use cases for NLP is categorizing our capital providers by funding type, and we have found it to be dramatically more effective than our paid analysts by comparing the rate of false positive vs.
We gathered our training set via a painstaking process of viewing and reviewing a subset of 6000 of our capital providers until we were certain that it was 100% accurate.
If your model fits the training dataset and also fits the test dataset, then you can be confident that there is minimal overfitting and that it is properly generalized to the population.
If not, it means that your model is overfitting the training data and you may need to either tweak some parameters or increase the size and quality of your training data.
The validation dataset ensures that you are not overfitting these hyperparameters in the same way the testing data ensures that you are not overfitting your model parameters.
A full explanation of what is going on under the hood is out of the scope of this tutorial, but the classifier essentially uses a series of guesses and checks to determine the main differences.
The generator avoids us having to store our large datasets in memory, allowing for retrieval of our texts on demand without clogging up resources.
We are interested in identifying common bigrams because these are very powerful features within our text data that help us determine overall meaning.
You will find with a lot of NLP work the bulk of the heavy lifting is gathering, cleaning and processing the data and that common packages handle the nuances of the Natural Learning Processing itself: We can then test some common bigrams that should appear: Now that we have our bigrams trained, we are going to create a dictionary to store the provider name, provider type, and text.
In this example, to keep things simple, we store a text file called provider_type.txt with the name of the provider type in the same directory as text.txt.
Next, we gather the bigrams from within the list of tokens and finally we add the tokens to our provider dictionary and we add that dictionary to our list of documents, which will later be split into our test and training datasets.
For our features, we have chosen to use the 100 most common words, excluding any words three characters or fewer, across all documents.
Once we have a list of features for each document, we can compare these features in order to determine which features are most applicable to each provider type (MFI or VC).
It is somewhat difficult to grasp that 62 lines of code in this example are dedicated to cleaning, processing and preprocessing data and that only 1 line is used for actually training the classifier, but this is the nature of data science.
Now, time to test our new classifier on our test data: This tells us the accuracy of our classifier when applied to our test data.
An example of a slightly deeper analysis you can perform is the following, which prints the provider name, provider type and first 50 tokens.
We can clearly see in the following output a couple of reasons why some of the capital providers are misclassified. For example, the first two are written in foreign languages.
If you’re interested in working on these kinds of projects please check our careers site, or email me at malcolm [at] alliedcrowds.com!
- On Monday, June 17, 2019
Data Mining with Weka (2.2: Training and testing)
Data Mining with Weka: online course from the University of Waikato Class 2 - Lesson 2: Training and testing Slides (PDF): ..
Testing and Training of Data Set Using Weka
how to train and test data in weka data mining using csv file.
Weka Tutorial 35: Creating Training, Validation and Test Sets (Data Preprocessing)
The tutorial that demonstrates how to create training, test and cross validation sets from a given dataset.
Save Classifier with Pickle - Natural Language Processing With Python and NLTK p.14
As you will likely find with any form of data analysis, there is going to be some sort of processing bottleneck, that you repeat over and over, often yielding the ...
Feeding your own data set into the CNN model in Keras
This video explains how we can feed our own data set into the network. It shows one of the approach for reading the images into a matrix and labeling those ...
Weka Text Classification for First Time & Beginner Users
59-minute beginner-friendly tutorial on text classification in WEKA; all text changes to numbers and categories after 1-2, so 3-5 relate to many other data analysis ...
How to Make a Simple Tensorflow Speech Recognizer
In this video, we'll make a super simple speech recognizer in 20 lines of Python using the Tensorflow machine learning library. I go over the history of speech ...
Handling Non-Numeric Data - Practical Machine Learning Tutorial with Python p.35
In this machine learning tutorial, we cover how to work with non-numerical data. This useful with any form of machine learning, all of which require data to be in ...
Text Classification Using Naive Bayes
This is a low math introduction and tutorial to classifying text using Naive Bayes. One of the most seminal methods to do so.
Implementing a Spam Classifier with Naive Bayes
Full course: We'll actually write a working spam ..