AI News, High versus low-level data science
- On Thursday, March 8, 2018
- By Read More
High versus low-level data science
By high level, I mean data science not done by statisticians, but by decision makers accessing, digging into, and understanding summary (dashboard) data to quickly make a business decision with immediate financial impact.
About the problem and data set The problem studied here was solved by the decision maker (a data scientist), using high-level data only, that is, highly summarized data.
The data scientist in question monitors carefully selected compound web traffic metrics (KPIs, usually ratios) and noticed a spectacular drop for one of these metrics Access to highly granular (low-level) data was not easy to get, and dashboard summaries, carefully selected and crafted, were sufficient to detect and address the issue with a one-week turnaround, doing a number of tests described in the next section.
But we did use session duration, number of pages, and conversions, per day per referral, probing the summary data sometimes 2-3 times per day to check the results of a number of tests and fine-tuning, in short to check and quantify impact on performance.
It was initially believed that the performance drop (click conversion falling from above 20% to below 10%) was due to click fraud, though we did not exclude a reporting glitch initially.
We assumed it could be an advertiser trying to deplete budgets from its competitors, hitting all data science related keywords and advertisers (we've proved in the past, with a real test, that you can indeed eliminate all your competitors on Google AdWords).
So we created another, smaller campaign - let's call it campaign B - with a slightly different mix of keywords, and especially using a redirect URL (www.datascienceworld.com) rather than the actual destination URL www.datasciencecentral.com, just in case the perpetrator was targeting us in particular.
What was surprising is the fact that even with the new campaign B with fewer keywords, our daily budget was not depleted (we could not reach our daily ad spend target, we were well below), and the ads were still showing on Google searches as usual, for our target keywords.
We are also confident, based on our observations, that the problem will fix itself on Google AdWords, and that we will be able to resume to higher levels of advertising (with Google) after careful testing and permanent monitoring. Finally, we optimized our bidding to maximize total conversions (on Google AdWords), rather than total clicks.
This bias is a bigger issue for digital publishers that own multiple channels, each one having its own domain name, rather than a sub-domain, and where visit paths typically span across multiple channels (e.g.
Figure 2: Google Analytics reports session duration 40% below real value These zero-seconds, one-page visits (as well as 10-minutes, either two-pages or fifty-pages visits) scare digital publishers and advertisers, as it looks like artificial traffic.
Correctly measuring duration of web sessions, via imputation We focus here on measuring the duration of one-page sessions only (currently measured as zero), as this will solve the whole problem. The idea is to The extrapolation step works as follows (see figure 2).
The bucketization step (sometimes called multivariate binning) consists of identifying metrics (and combinations of 2-3 metrics) with high predictive power, combine and bin them appropriately, to reduce intra-bucket variance while keeping the buckets big enough.
And as in all inference processes, don't forget the very important cross-validation step. You can go one step further by defining a similarity metric between data buckets, and extrapolate using multiple, similar buckets, rather than just the target bucket alone.
Who's downloading pirated papers? Everyone
Just as spring arrived last month in Iran, Meysam Rahimi sat down at his university computer and immediately ran into a problem: how to get the scientific papers he needed.
Purchasing the papers was going to cost $1000 this week alone—about as much as his monthly living expenses—and he would probably need to read research papers at this rate for years to come.
“Publishers give nothing to the authors, so why should they receive anything more than a small amount for managing the journal?” Many academic publishers offer programs to help researchers in poor countries access papers, but only one, called Share Link, seemed relevant to the papers that Rahimi sought.
I asked her for the data because, in spite of the flurry of polarized opinion pieces, blog posts, and tweets about Sci-Hub and what effect it has on research and academic publishing, some of the most basic questions remain unanswered: Who are Sci-Hub’s users, where are they, and what are they reading?
After establishing contact through an encrypted chat system, she worked with me over the course of several weeks to create a data set for public release: every download event over the 6-month period starting 1 September 2015, including the digital object identifier (DOI) for every paper.
(The data set and details on how it was analyzed are freely accessible) Server log data for the website Sci-Hub from September 2015 through February paint a revealing portrait of its users and their diverse interests.
The United States is the fifth largest downloader after Russia, and a quarter of the Sci-Hub requests for papers came from the 34 members of the Organization for Economic Cooperation and Development, the wealthiest nations with, supposedly, the best journal access.
In October last year, a New York judge ruled in favor of Elsevier, decreeing that Sci-Hub infringes on the publisher’s legal rights as the copyright holder of its journal content, and ordered that the website desist.
Elbakyan declined to say exactly how she obtains the papers, but she did confirm that it involves online credentials: the user IDs and passwords of people or institutions with legitimate access to journal content.
“I cannot confirm the exact source of the credentials,” Elbakyan told me, “but can confirm that I did not send any phishing emails myself.” So by design, Sci-Hub’s content is driven by what scholars seek.
It has news articles from scientific journals—including many of mine in Science—as well as copies of open-access papers, perhaps because of confusion on the part of users or because they are simply using Sci-Hub as their all-in-one portal for papers.
The flow of Sci-Hub activity over time reflects the working lives of researchers, growing over the course of each day and then ebbing—but never stopping—as night falls.
(There is an 18-day gap in the data starting 4 November 2015 when the domain sci-hub.org went down and the server logs were improperly configured.) By the end of February, the flow of Sci-Hub papers had risen to its highest level yet: more than 200,000 download requests per day.
The GWU press office responded defensively, sending me to an online statement that the university recently issued about the impact of journal subscription rate hikes on its library budget.
Researchers in Argentina may have trouble obtaining some specialty journals, she notes, but “most of them have no problem accessing big journals because the government pays the subscription at all the public universities around the country.” Even for journals to which the university has access, Sci-Hub is becoming the go-to resource, says Gil Forsyth, another GWU engineeringPh.D.
The GWU library system “offers a document delivery system specifically for math, physics, chemistry, and engineering faculty,” I was told by Maralee Csellar, the university’s director of media relations.
“Graduate students who want to access an article from the Elsevier system should work with their department chair, professor of the class, or their faculty thesis adviser for assistance.” The intense Sci-Hub activity in East Lansing reveals yet another motivation for using the site.
Bill Hart-Davidson, MSU’s associate dean for graduate education, suggests that the likely answer is “text-mining,” the use of computer programs to analyze large collections of documents to generate data.
“It suggests an almost complete failure to provide a path of access for these researchers.” He works for a company that publishes some of the most heavily downloaded content on Sci-Hub and requested anonymity so he could speak candidly.
For researchers at institutions that cannot afford access to journals, he says, the publishers “need to make subscription or purchase more reasonable for them.” Richard Gedye, the director of outreach programs for STM, the International Association of Scientific, Technical and Medical Publishers, disputes this.
Institutions in the developing world that take advantage of the publishing industry’s outreach programs “have the kind of breadth of access to peer-reviewed scientific research that is pretty much the equivalent of typical institutions in North America or Europe.” And for all the researchers at Western universities who use Sci-Hub instead, the anonymous publisher lays the blame on librarians for not making their online systems easier to use and educating their researchers.
The authentication systems that university researchers must use to read subscription journals from off campus, and even sometimes on campus with personal computers, “are there to enforce publisher restrictions,” she says.
Although Sci-Hub helps a great many researchers, he notes, it may also carry a “strategic cost” for the open-access movement, because publishers may take advantage of “confusion” over the legality of open-access scholarship in general and clamp down.
“Lawful open access forces publishers to adapt,” he says, whereas “unlawful open access invites them to sue instead.” Even if arrested, Elbakyan says Sci-Hub will not go dark.
If she runs out of credentials for pirating fresh content, that gap will grow again, however—and publishers and universities are constantly devising new authentication schemes that she and her supporters will need to outsmart.
Google Analytics: 21 Inaccurate Traffic Sources, Setup Mistakes …and Fixes
Although we’d like to believe that search traffic is everyone who discovered us through a keyword search, there are two problems here: branded searches and hidden keyword data.
Here’s what happens: Anytime you search using Google and you see an https:// (rather than http://, with no ‘s’) in the address bar of the browser, the phrase for which you searched will not be shared with the website owner if you click on a link in Google search results.
The keyphase will appear as “(not provided).” Keyword data is hidden for anyone using the search field in the Firefox browser, anyone using the Chrome “Omnibox” (the combination address bar, search field), and anyone logged into any Google product.
But in practice, it often includes traffic driven from social media efforts, email and in some cases, traffic from one page on your site to another.
Google is working on better segmentation of social traffic, but so far, social traffic isn’t even part of the traffic sources overview report!
Fix 1 : Some social media tools can be set up to add campaign tracking code to shortened links.
If you consistently tweet or post from a service like the pro version of HootSuite, you can have this code added automatically to all shortened links.
Mail, the traffic is recorded as a referral, even though the real traffic source was a newsletter, something you’d like to track separately.
Your email service provider may have a feature that makes this easy or you can use this guide on how to use the Google Analytics URL builder. Link issues often lead to inaccurate referral traffic data.
When you create a link from one page to another, you can add the full address (www.site.com/page) or just add the relative address (/page without the www.site.com).
Here is the true definition of direct traffic: All traffic for which a referrer wasn’t specified. In other words, direct traffic isn’t just people who entered your address into a browser, it’s any visit that isn’t from a link on a website or search engine and doesn’t have campaign code.
Fix: Some apps will automatically shorten links, adding a referrer by redirecting the visitor through another address (official Twitter apps sent traffic through t.co address).
But no matter how carefully you tweet and post for better tracking, you can’t make the rest of your fans and followers do the same.
As with the referral traffic / web mail problem, some of the traffic from that newsletter you sent will be opened in email programs like Outlook or Mail.
But if the tracking code has problems, it doesn’t put the visit back into referring traffic, it puts it into direct traffic.
If you don’t have filters properly set up, it’s possible that a significant portion of your traffic is coming from you, your office, or your team.
It’s really unique computer, so if one person visits your site from three devices (including the one in their pocket), then Analytics will tell you that three unique people visited.
If you set up too many goals, including event tracking goals, you’ll artificially inflate your overall conversion rate.
If, for example, you separate all mobile traffic to a separate profile, you can’t easily compare it to your other traffic data.
Profiles are meant to segment out completely different types of visits from your normal marketing stats, such as traffic within a login area. One last reason why Google Analytics isn’t giving you the full picture: lag.
You can help by making sure things are set up properly: We’re hoping to make this article a comprehensive list of problems with traffic sources.
The token bucket is an algorithm used in packet switched computer networks and telecommunications networks.
It can be used to check that data transmissions, in the form of packets, conform to defined limits on bandwidth and burstiness (a measure of the unevenness or variations in the traffic flow).
It can also be used as a scheduling algorithm to determine the timing of transmissions that will comply with the limits set for the bandwidth and burstiness: see network scheduler.
The token bucket algorithm is based on an analogy of a fixed capacity bucket into which tokens, normally representing a unit of bytes or a single packet of predetermined size, are added at a fixed rate.
When a packet is to be checked for conformance to the defined limits, the bucket is inspected to see if it contains sufficient tokens at that time.
equivalent to the length of the packet in bytes, are removed ('cashed in'), and the packet is passed, e.g., for transmission.
conforming flow can thus contain traffic with an average rate up to the rate at which tokens are added to the bucket, and have a burstiness determined by the depth of the bucket.
arrive or be transmitted) than would be expected from the limit on the average rate, or a burst tolerance or maximum burst size, i.e.
The token bucket algorithm can be conceptually understood as follows: Implementers of this algorithm on platforms lacking the clock resolution necessary to add a single token to the bucket every
Given the ability to update the token bucket every S milliseconds, the number of tokens to add every S milliseconds =
In traffic policing, nonconforming packets may be discarded (dropped) or may be reduced in priority (for downstream traffic management functions to drop if there is congestion).
Traffic policing and traffic shaping are commonly used to protect the network against excess or excessively bursty traffic, see bandwidth management and congestion avoidance.
Traffic shaping is commonly used in the network interfaces in hosts to prevent transmissions being discarded by traffic management functions in the network.
The token bucket algorithm is directly comparable to one of the two versions of the leaky bucket algorithm described in the literature. This comparable version of the leaky bucket is described on the relevant Wikipedia page as the leaky bucket algorithm as a meter.
This is a mirror image of the token bucket, in that conforming packets add fluid, equivalent to the tokens removed by a conforming packet in the token bucket algorithm, to a finite capacity bucket, from which this fluid then drains away at a constant rate, equivalent to the process in which tokens are added at a fixed rate.
There is, however, another version of the leaky bucket algorithm, described on the relevant Wikipedia page as the leaky bucket algorithm as a queue.
This is a special case of the leaky bucket as a meter, which can be described by the conforming packets passing through the bucket.
The leaky bucket as a queue is therefore applicable only to traffic shaping, and does not, in general, allow the output packet stream to be bursty, i.e.
The hierarchical token bucket (HTB) is a faster replacement for the class-based queueing (CBQ) queuing discipline in Linux. It is useful to limit a client's download/upload rate so that the limited client cannot saturate the total bandwidth.
In HTB, rate means the guaranteed bandwidth available for a given class and ceil is short for ceiling, which indicates the maximum bandwidth that class is allowed to consume.
Any bandwidth used between rate and ceil is borrowed from a parent class, hence the suggestion that rate and ceil be the same in the top-level class.
Hierarchical Token Bucket implements a classful queuing mechanism for the linux traffic control system, and provides rate and ceil to allow the user to control the absolute bandwidth to particular classes of traffic as well as indicate the ratio of distribution of bandwidth when extra bandwidth become available(up to ceil).
- On Monday, June 1, 2020
Pedro Miguel Cruz - Lisbon's Slow Traffic Areas, Data Visualisation 2010
534 vehicles, during October 2009 in Lisbon, leaving route trails and condensed in one single day. Rapid arteries are drawn with greenish and cooler colors, while the sluggish ones are reddish...
1st place science fair ideas- 10 ideas and tricks to WIN!
10 ideas and some power tips to make you the king of your science fair! FREE Audio Book- A review of all the books I've listened to recently:
Data Visualization: Where in the world do people like Old Spice?
Data Layers Include: Twitter Data Using Twitter Analytics for Excel Addin: Bing Search Data using the Bing Ads Intelligence Addin for Excel:..
How to Scale Marketing 200% with No Budget- INBOUND 2017 Session
As the first marketing hire of a startup with no budget, how do you scale leads by 200% in a year? In the last two companies I've worked at, I've been the first and only marketing hire. In...
How to architect a collaborative big data platform (Google Cloud Next '17)
Collaboration on big data can take place with effective storing, cataloging, and fronting of data sources with robust APIs so that developers can quickly build applications using multiple data...
Cloud Spanner 101: Google's Mission-Critical Relational Database (Google I/O '17)
In this session, we'll share an overview of Google Cloud Spanner, Google's mission-critical, relational and scalable application database, which is now publicly available as a managed service...
Max Tegmark: "Life 3.0: Being Human in the Age of AI" | Talks at Google
Max Tegmark, professor of physics at MIT, comes to Google to discuss his thoughts on the fundamental nature of reality and what it means to be human in the age of artificial intelligence. ...
We Could Back Up The Entire Internet On A Gram Of DNA
Nature's code for life is stored in DNA, but what if we could code anything we wanted into DNA? Scientists are figuring out how. Pluto Could Be Made A Planet Again, Along With 102 Other...
Your Brain at Work
Google Tech Talk November 12, 2009 ABSTRACT Presented by David Rock. In his new book "Your Brain at Work," coach David Rock depicts the story of two people over one day at the office,...
Google Cloud Next '17 - Day 2 Keynote
Google Cloud announces new products for Google Cloud Platform (GCP) and G Suite. Hear from Google's product and engineering leaders, including Urs Hölzle, Prabhakar Raghavan, Brian Stevens...