AI News, Introduction to Machine Learning with Python's Scikit-learn

Introduction to Machine Learning with Python's Scikit-learn

In this post, we'll be doing a step-by-step walkthrough of a basic machine learning project, geared toward people with some knowledge of programming (preferably Python), but who don’t have much experience with machine learning.

By the end of this post, you'll understand what machine learning is, how it can help you, and be able to build your own machine learning classifiers for any dataset you want.

Traditionally, to write such a classifier, we would manually inspect hundreds of clickbait headlines and try to identify patterns to differentiate them from "good"

This is a time consuming process that requires an expert to create the rules, and requires a lot of code maintenance, because we would probably need to continuously update and modify the rules.

You can edit the example code in any text editor or IDE of your choice and run them using the Python integrated with your IDE or through a command line call.

For example, my prompt looks like this: You can install the dependencies we need using pip, as follows: If this doesn’t work as expected, refer to the Jupyter installation guide and scikit-learn installation docs.

You can do this by navigating to the file using your web browser or by running the following command in your shell: Let's take a quick look at the data to understand what our goal is.

If you open up the clickbait.txt file in your favourite editor, you'll see that the first few lines looks like this: Each line contains a headline, followed by a tab character (\t), followed by either a 1 or a 0.

Our machine learning algorithm should look at the headline’s text and decide whether to assign a positive label, 1, to indicate that the headline looks like clickbait, or a negative label, 0, to indicate that the headline looks normal.

You can do this by running the following command in your shell, which should open your web browser and navigate to the Jupyter web application automatically: You can now create a new Python notebook by selecting "New"

Change the newly created notebook's name to clickbait_classifier by clicking on the title (indicated by the top red rectangle in the image below).

If we want to see results a specific stage, or edit some code we wrote previously, there's no need to run the entire script again, because the local variables of each cell are persisted and made available to other cells.

Specifically, we'll split the dataset that we looked at above into two pieces — a large portion to train our classifier, and a smaller portion to evaluate it.

Add the following code to the next cell and run as usual: You should see the following output: The first label corresponds to the first headline, the second label to the second headline, and so on.

The classifier will never see the test set’s labels, but it will try to predict which headlines are clickbait based on the patterns it learned from the training set.

Run the following code in a new cell to see how big our dataset is: This should produce output indicating that our dataset contains 10,000 examples.

First, we need a way to translate our text data into a matrix of numbers, as machine learning algorithms work by performing mathematical calculations to separate numerical points in multidimensional space.

We'll call fit_transform on our training data, which means that our vectorizer will assume that the words found in our training set represent all of the vocabulary that we're interested in.

We can create our train and test vectors by running the following code in a new cell: Now that we have vectors, we can train the classifier to look for patterns in our train set.

This is in an iterative and computationally expensive process that can take a very long time for large datasets, but should complete in a second or less on our small dataset.

Run the following code in a new cell: The predictions variable now contains an array of labels — one label for each headline in our test set.

We can look at the first five headlines in our test set (which our classifier didn't see during training time) by running the following code in a new cell: This should produce the following output: We can see what our classifier thinks about each of these headlines by running the following in a new cell: There is some randomness used as part of the training algorithm, so your results may differ slightly.

Run the following code in a new cell: This should output the following, confirming that the classifier got all five cases correct: We can compute the accuracy score of all of the test cases by using the accuracy_score function.

This is simply the number of times that the classifier was correct (when the label it predicted was the same as the label we provided) divided by the total number of labels (2,000 for our test set).

We can get this number by running the following code in a new cell: Again, your results might differ slightly due to the randomness used during training, but you should see something similar to the output below: This shows that our classifier got 96 percent of the test cases correct.

We saw how easy the high-level scikit-learn library is to use, and we evaluated our classifier with both data from the original dataset and new data from the BuzzFeed homepage.

If you want to see more practical text classification projects, take a look at my Yelp Dataset Challenge GitHub repository, where I show how to do sentiment analysis and authorship attribution using online reviews.

Introduction to Machine Learning with Python's Scikit-learn

In this post, we'll be doing a step-by-step walkthrough of a basic machine learning project, geared toward people with some knowledge of programming (preferably Python), but who don’t have much experience with machine learning.

By the end of this post, you'll understand what machine learning is, how it can help you, and be able to build your own machine learning classifiers for any dataset you want.

Traditionally, to write such a classifier, we would manually inspect hundreds of clickbait headlines and try to identify patterns to differentiate them from "good"

This is a time consuming process that requires an expert to create the rules, and requires a lot of code maintenance, because we would probably need to continuously update and modify the rules.

You can edit the example code in any text editor or IDE of your choice and run them using the Python integrated with your IDE or through a command line call.

For example, my prompt looks like this: You can install the dependencies we need using pip, as follows: If this doesn’t work as expected, refer to the Jupyter installation guide and scikit-learn installation docs.

You can do this by navigating to the file using your web browser or by running the following command in your shell: Let's take a quick look at the data to understand what our goal is.

If you open up the clickbait.txt file in your favourite editor, you'll see that the first few lines looks like this: Each line contains a headline, followed by a tab character (\t), followed by either a 1 or a 0.

Our machine learning algorithm should look at the headline’s text and decide whether to assign a positive label, 1, to indicate that the headline looks like clickbait, or a negative label, 0, to indicate that the headline looks normal.

You can do this by running the following command in your shell, which should open your web browser and navigate to the Jupyter web application automatically: You can now create a new Python notebook by selecting "New"

Change the newly created notebook's name to clickbait_classifier by clicking on the title (indicated by the top red rectangle in the image below).

If we want to see results a specific stage, or edit some code we wrote previously, there's no need to run the entire script again, because the local variables of each cell are persisted and made available to other cells.

Specifically, we'll split the dataset that we looked at above into two pieces — a large portion to train our classifier, and a smaller portion to evaluate it.

Add the following code to the next cell and run as usual: You should see the following output: The first label corresponds to the first headline, the second label to the second headline, and so on.

The classifier will never see the test set’s labels, but it will try to predict which headlines are clickbait based on the patterns it learned from the training set.

Run the following code in a new cell to see how big our dataset is: This should produce output indicating that our dataset contains 10,000 examples.

First, we need a way to translate our text data into a matrix of numbers, as machine learning algorithms work by performing mathematical calculations to separate numerical points in multidimensional space.

We'll call fit_transform on our training data, which means that our vectorizer will assume that the words found in our training set represent all of the vocabulary that we're interested in.

We can create our train and test vectors by running the following code in a new cell: Now that we have vectors, we can train the classifier to look for patterns in our train set.

This is in an iterative and computationally expensive process that can take a very long time for large datasets, but should complete in a second or less on our small dataset.

Run the following code in a new cell: The predictions variable now contains an array of labels — one label for each headline in our test set.

We can look at the first five headlines in our test set (which our classifier didn't see during training time) by running the following code in a new cell: This should produce the following output: We can see what our classifier thinks about each of these headlines by running the following in a new cell: There is some randomness used as part of the training algorithm, so your results may differ slightly.

Run the following code in a new cell: This should output the following, confirming that the classifier got all five cases correct: We can compute the accuracy score of all of the test cases by using the accuracy_score function.

This is simply the number of times that the classifier was correct (when the label it predicted was the same as the label we provided) divided by the total number of labels (2,000 for our test set).

We can get this number by running the following code in a new cell: Again, your results might differ slightly due to the randomness used during training, but you should see something similar to the output below: This shows that our classifier got 96 percent of the test cases correct.

We saw how easy the high-level scikit-learn library is to use, and we evaluated our classifier with both data from the original dataset and new data from the BuzzFeed homepage.

If you want to see more practical text classification projects, take a look at my Yelp Dataset Challenge GitHub repository, where I show how to do sentiment analysis and authorship attribution using online reviews.

HTML and CSS

To create an OMO website, I suggest that: Today, the W3C (World-Wide Web Consortium) (@ http://www.w3c.org) maintains the specifications of HTML and CSS (and many other related web technologies).

HTML markup tags perform these functions: The purpose of a markup language is to relieve the content provider from worrying about the actual appearance of the document.

The author merely indicates (via markup tags) the semantic meaning of the words and sentences (such as paragraph, heading, emphasis, and strong), and leave it to the browser to interpret the markups and render the document for display on the screen.

The content provider focuses on the document contents, while the graphic designer concentrates on the view and presentation.

Nowadays, HTML should be used solely to markup the contents, while its companion technology known as CSS (Cascading Style Sheet) be used for defining the presentation of the document, so as to separate the content and presentation.

These are the common pitfalls in older HTML documents and you should avoid: HTML documents can be created by a wide range of tools, from simple plain text editors (such as Windows' NotePad, Mac's TextEdit) to sophisticated WYSIWYG authoring tools (e.g., DreamWeaver).

All of these presentation attributes are concerned about the appearance instead of the content, and have been deprecated in HTML 4 in favor of style sheet.

HTML documents are textual and self-explanatory, comments are less important (but still nice to have) to describe the various part of the documents.

Elements can be classified as: In brief, a block element is always rectangular in shape, while an inline element spans a continuous run of characters.

Note that the line breaks in the HTML codes are treated as white spaces and do not translate to new lines in the display.

By default, the rule is full width (100%) across the screen, 1 point in size, and has a shading effect for a 3D appearance.

container tags are treated as pre-formatted, i.e., white space, tabs, new-line will be preserved and not ignored.

block elements (together with its inline counterpart <span>) are extensively used in the modern web pages to mark out a rectangular block (or span of text).

elements to structure a document into various sections and apply the formatting style, For example, This is less than desirable, as <div>

They are: <header>, <footer>, <nav>, <section>, <article>, <summary>, <details>, <aside>, <figure>, <figcaption>, and <main>.

elements can be used to markup the header and footer of a web page, in place of the less semantic pre-HTML5 <div id|class="header"|"footer">.

element is used to markup an independent and self-contained article such as a news story, which could have its own header, footer and content sections.

element can be used to introduce related contents, typically formatted in a floating sidebar alongside the main texts.

element (introduced in HTML5.1) marks the main content of a web page, excluding the header, footer, and navigation menu.

whereas physical-style formatting tags define the physical or typographical appearance (e.g., bold, italic, teletype).

This is because physical styles deal with the appearance, which should be defined in style sheet, so as to separate the content and presentation.

opening tag, the full text will be shown as tool tip, when you point your mouse pointer to the element.

elements are extensively used in the modern web pages to mark out a run of texts, primarily for applying style.

These elements are hardly used due to poor browser support, but presented here for completeness.

The commonly used entity references are as follows (there are many many more, refer to the HTML reference - I like the arrows, Greek symbols, and the mathematical notations).

HTML supports three types of lists: ordered list, unordered list and definition list.

You can place a list inside another list (called nested lists), by writing a complete list definition under an <li>

(define the spacing between the content of the cell and its boundaries, in pixels), are often used in older HTML pages but now deprecated.

Rule of Thumb: Always use relative URLs for referencing documents in the same server for portability (i.e., when you move from a test server to a production server).

This is similar to the implicit anchor name setup via the id attribute described earlier.

The color of the border is given in the link (unvisited), vlink (visited), alink (active) attributes of the <body>

The original aim of HTML is to let the content providers concentrate on the contents of the document and leave the appearance to be handled by the browsers.

Authors markup the document contents using markup tags (such as <p>, <h1>, <ul>, <table>, <img>) to indicate its semantic meaning ("This is a paragraph", "This is heading Level 1", "This is an unordered list", "This is a table", "This is an image").

Many markup tags and attributes were created for marking the appearance and the display styles (e.g., <font>, <center>, align, color, bgcolor, link, alink, vlink are concerned about the appearance in font, color and alignment) rather than the meaning of the contents.

Furthermore, over the years, we have engaged graphic designers to work on the appearance and leave the content providers to focus on the contents.

The W3C (World-Wide Web Consortium @ www.w3c.org) responded to the need of separating document's contents and presentation by introducing a Style Sheet Language called CSS (Cascading Style Sheet) for presentation, and removing the presentation tags and attributes from HTML.

It allows web graphic designers to spice up the web pages, so that the content providers can focus on the document contents with HTML.

A style rule is used to control the appearance of HTML elements such as their font properties (e.g., type face, size and weight), color properties (e.g., background and foreground colors), alignment, margin, border, padding, and positioning.

The browser follows a certain cascading order in finalizing a style to format the HTML element in a predictable fashion.

On CSS, you could select (inspect) an HTML element, and it will show you all the cascading style rules that are applied to that elements from all the sources (inline, embedded, external), and how the rules were merged and conflicts were resolved.

the name-value pairs are separated by spaces, as follows: There are three places where you can define style rules: To apply inline style to an HTML element, include the list of style properties in the style attribute of the opening tag.

For example, Take note that the name and value are separated by colon ':', and the name:value pair are separated by semicolon ';', as specified in the CSS syntax.

Inline style defeats the stated goal of style sheets, which is to separate the document’s content and presentation.

Hence, inline style should be avoided and only be used sparsely for touching up a document, e.g., setting the column width of a particular table.

For example, we define these style rules in a file called "TestExternal.css": This HTML document references the external style sheet via the <link>

The main advantage of external style sheets is that the same set of styles can be applied to all HTML pages in your website to ensure uniformity in presentation.

element, you can also use CSS's @import directive to link to an external style sheet, as follows: @import is part of the CSS language.

Style conflict on an element arises: If a property is not defined for an element and is inheritable, it will be inherited from the nearest ancestor.

The inline style (applied to a specific tag via style attribute) overrides the internal style (defined in <style>) and external style sheet (defined via <link>).

You can override all the cascading rules by appending a special property-value !important, e.g., To use CSS to style your website for good and consistent look and feel, you need to properly structure and partition your web pages.

They can be used to create partitions in an HTML document to represent logical sections (such as header, content, footer, highlight text, and so on).

CSS ID-selector, which begins with a '#' followed by an id value, selects a unique element (because id value is supposed to be unique) in the document.

The CSS file "MyStyle.css": As illustrated in the previous example, a CSS selector can select a set of HTML elements based on (a) tag name, (b) id attribute, (c) class attribute.

In addition, you can write a CSS selector to select elements based on combination of tag, id and class, and much more.

The syntax is: Example: You can apply the same style definitions to multiple selectors, by separating the selectors with a commas ','.

Example: You can define a style rule that takes effect only when a tag occurs within a certain contextual structure, e.g., descendant, immediate-child, first-child, sibling, etc.

To create a descendant selector, list the tags in their hierarchical order, with no commas separating them (commas are meant for grouping selectors).

The Generic-Class Selector, which begins with a dot '.' followed by the classname, selects all elements with the given classname, regardless of the tag name.

(This is the same restriction for identifiers in most of the programming languages.) The ID-selector, begins with a '#' followed by the id value, selects a specific element with the given unique id value.

Example: CSS defines a number of pseudo-classes for anchor elements <a>, namely, a:link (unvisited link), a:visited (visited link), a:focus (on focus), a:hover (mouse pointer hovers over), a:active (clicked or active link).

Example: Notes: Another Example, The anchor pseudo-classes can be combined with ID-selectors as a Descendant-selector, so that the appearance of the links is different for the different divisions of the document, e.g., The selector p:first-line and p:first-letter select the first line and the first letter of the <p>

For example, CSS3 introduces these pseudo-class child selectors: The :not(S) selector lets you select elements not meeting the criterion in selector S.

For a complete list of the style properties, you may refer to the CSS specification (@ W3C), or online reference (@ http://www.w3schools.com/cssref/default.asp).

Color can be expressed as: The most important color properties are color and background-color: Many CSS properties, such as width, height, margin, border, padding, font-size and line-height, require a length measurement.

For example, There are two types of length measurements: relative (to another length property) and absolute (e.g., inches, centimeters, millimeters).

The absolute units are: The relative units are: There shall be no space between the number and the unit, as space is used to separate multiple values.

Take note that % and em measurement are relative to another element (percentage values are always relative, e.g., 50% of something).

to <h6>) is always rectangular in shape and exhibits the so-called box model, with four virtual rectangles wrap around its "content area", representing the content area, padding, border, margin, as illustrated below.

As illustrated in the box model diagram, margin pushes its border (and content) away with a transparent background showing the parent (having the effect of pushing itself away from the parent);

Take note that the width and height that you set for the element specify its content area, exclude the margin, border and padding.

Each of the rectangular bounds has four sides, and can be individually referred to as xxx-top, xxx-right, xxx-bottom, and xxx-left in a clockwise manner, where xxx could be margin, border or padding.

As mentioned earlier, CSS length measurement requires a proper unit, e.g., width:400px or width:80%.

The margin, border and padding related properties are: Margin, border, padding, width are NOT inherited by its descendants.

Example: [TODO] For most of the block elements (e.g., <div>, <p>), the default of width:auto sets the width to the width of the parent minus its own margin, border and padding.

Example: [TODO] Browser would automatically adjust the margin-right to fill the container's width if the sum of its width, left and right margin/border/padding does not add up to the full width of the containing element.

Example: [TODO] To center a block element, you set the margin-left and margin-right to auto (browser divides the remaining width to left and right margins equally).

Example: [TODO] The frequently used text properties are: The background related properties are: The background properties has a one-line shorthand notation, with the order shown as below: In all the above, the term background refers to the background of the elements selected (not necessary the entire window).

There are two type of image maps: To create a client-side image map: Example: Function: To set up a client-side image map with hot regions.

client-side image map can be used as a navigation bar on top of the page, instead of using individual images.

For example, When the image is clicked, the (x, y) position of the click is send to the server as query parameters.

For example, Client-side image map is much more popular (and recommended) than server-side image map because: You can divide the browser's window into multiple regions called frames.

tag (in the HEAD section) to set up a global target reference: As mentioned, frames has gone out of favor these day.

An iframe is a frame that can be embedded within a regular HTML page, which contains a separate and complete HTML document, set via its src attribute.

The browser supports the execution of client-side programs via a built-in processor or plug-in (e.g., the Java JRE Plug-in, Flash player plug-in).

The following tags can be used in the HEAD section: Function: Declare the base URL for all the links in this document via attribute "href", and specify the target name via attribute "target".

favicon (aka favorite icon, shortcut icon, URL icon) is a file containing a small 16x16 icon.

or use a simple imaging tool (such as MS Paint) to create a small image and then submit to an online converter to generate a favicon file.

tag to include meta information about the document, such as keywords, author, expiration date, and page generator.

in the HTTP response message, when the page is downloaded: The browser, in response to this response header, redirect to the given URL after 3 sec.

The server will include this response header in the response message, when the page is downloaded: Function: For embedded style declarations, covered earlier.

you can remove the box from the normal flow and specify its location with respect to either its parent element (position:absolute) or the browser window (position:fixed);

For non-static positioned elements, the new position is specified via top, left, bottom, right, width, height properties: The default position:static positions the element according to the normal flow of the page, in the order that is read by the browser.

To absolutely position an element in a containing element (other than <body>), declare the containing element relative without any movement, e.g., container { position:relative }.

You can create animation (such as bouncing ball and falling snow) by absolutely position (and repeatedly re-position) images on the browser's screen.

float: left|right|none You can push an element to the left or right edge of the containing element via property float.

If many images are floated together (says to the left), the second image will be pushed to the left edge of the first image, and so on if there is available horizontal space;

For example, we can float many thumbnail images to the left as follows: To turn off the float, use property clear, and specify which side (left, right or both) does not allow a floating element.

The CSS property float: left|right|none (also applicable to <img>) floats the iframe to the left or right margin of the browser's window.

sets a rectangular mask in the form of rect(top right, bottom, left), where top, right, bottom, and left are relative to the containing block.

Web Scraping 101 with Python

After you're done reading, check out my follow-up to this post here.

Yea, yea, I know I said I was going to write more on pandas, but recently I've had a couple friends ask me if I could teach them how to scrape data.

While they said they were able to find a ton of resources online, all assumed some level of knowledge already.

If you mouse over that line in your browser's dev tools, you'll notice that it highlights the entire section of category links we want.

section holds all the links we want, let's write some code to find that section, and then grab all of the links within the <dd>

Hopefully this code is relatively easy to follow, but if not, here's what we're doing: Now that we have our list of category links, we'd better start going through them to get our winners and runners-up.

Now imagine you've written many more functions to scrape this data - maybe one to get addresses and another to get paragraphs of text about the winner - you've likely repeated those same two lines of code in these functions and you now have to remember to make changes in four different places.

When you notice that you've written the same lines of code a couple times throughout your script, it's probably a good idea to step back and think if there's a better way to structure that piece.

We'll have to change our other functions a bit now, but it's pretty minor - we just need to replace our duplicated lines with the following: Now that we have our main functions written, we can write a script to output the data however we'd like.

While both tasks are somewhat outside of my intentions for this post, if there's interest, let me know in the comments and I'd be happy to write more.

State to Adopt 2018 Building Codes The state of South Carolina has published an Intent to Adopt the 2018 International Codes and 2017 National Electrical Code.

You may now schedule your inspections online and receive the results of your inspection via email, your smartphone or tablet!

*Please remember to request inspections no later than 3 pm on the business day prior to the day you need the inspection.Beginning January 1st 2018, all builders will be required to utilize the portal to request inspection services.

Our department has been assigned scores of 3 for Commercial Construction and 4 for Residential Construction with 1 being perfect and 9 having no code enforcement.