AI News, The most powerful idea in data science

The most powerful idea in data science

If you take an introductory statistics course, you’ll learn that a datapoint can be used to generate inspiration or to test a theory, but never both.

Even the odd apopheny — from the term apophenia (the human tendency to mistakenly perceive connections and meaning between unrelated things) — can get your creative juices flowing.

(I’ve never mislaid my financial records, but I imagine the United States government wouldn’t be delighted with me if my response to losing them was to use the data imputation techniques I learned in grad school to pay my taxes statistically.) Occasionally, the facts you have aren’t the same as the facts you wish you had.

When you don’t possess all the information required for the decision you’d love to make, you’ll need to navigate uncertainty as you try to pick a reasonable course of action.

Unfortunately, you’ll find the other kinds of patterns in your data too — that’s the big challenge at the heart of data science: how not to wind up less informed as a result of looking at data.

The whole rhetoric of statistical hypothesis testing hinges on surprise, and it’s in bad taste to pretend to be surprised by a pattern you already know is in your data.

If you use up your dataset in your quest for inspiration, you can’t use it again to rigorously test the theory it inspired (no matter how much mathemagical jiu-jitsu you whip out, since math is never a counterspell to basic common sense).

If you have only one dataset, you’re forced to ask yourself: “Do I meditate in a closet, set up all my statistical testing assumptions, and then carefully take a rigorous approach so I can take myself seriously?

If the pattern that inspired you in the first place also exists in the data that didn’t have a chance to influence your opinions, that’s a more promising vote in favor of the pattern being a general thing in the cat litter box you scooped your data from.

(Learn more about the history of data science here.) Some projects still have that problem today, especially in medical research (I used to be in neuroscience, so I have a lot of respect for how hard it is to work with small datasets) but many of you have so much data that you need to hire engineers just to move it all around… what’s your excuse?!

(I explain why in Machine Learning is Automated Inspiration.) To take advantage of the best idea in data science, all you have to do is make sure you keep some test data out of reach of prying eyes, then let your analysts go wild on the rest.

Anti-Pattern: Sharing page objects, using your UI to log in, and not taking shortcuts.

Given a button that we want to interact with: Let’s investigate how we could target it: Targeting the element above by tag, class or id is very volatile and highly subject to change.

You may swap out the element, you may refactor CSS and update ID’s, or you may add or remove classes that affect the style of the element.

Instead, adding the data-cy attribute to the element gives us a targeted selector that’s only used for testing.

The data-cy attribute will not change from CSS style or JS behavioral changes, meaning it’s not coupled to the behavior or styling of an element.

When determining an unique selector it will automatically prefer elements with: After reading the above rules you may be wondering: If I should always use data attributes, then when should I use cy.contains()?

simple rule of thumb is to ask yourself this: If the content of the element changed would you want the test to fail?

If you’re familiar with Cypress commands already, but find yourself using const, let, or var then you’re typically trying to do one of two things: For working with either of these patterns, please read our Variables and Aliases guide.

You may want to access 3rd party servers in several situations: Initially you may be tempted to use cy.visit() or use Cypress to traverse to the 3rd party login window.

However, you should never use your UI or visit a 3rd party site when testing because: Let’s look at a few strategies for dealing with these situations.

For instance, if you try to test Google, Google will automatically detect that you are not a human and instead of giving you an OAuth login screen, they will make you fill out a captcha.

Additionally, testing through an OAuth provider is mutable - you will first need a real user on their service and then modifying anything on that user might affect other tests downstream.

Typically, when going through scenarios like user registration or forgotten passwords, your server schedules an email to be delivered.

The easiest way to check that this happened is likely with a unit or integration test at the server level and not at the end-to-end level.

You only need to do one thing to know whether you’ve coupled your tests incorrectly, or if one test is relying on the state of a previous one.

This above example is ideal because now we are resetting the state between each test and ensuring nothing in previous tests leaks into subsequent ones.

Best Practice: Add multiple assertions and don’t worry about it We’ve seen many users writing this kind of code: While technically this runs fine - this is really excessive, and not performant.

Because nearly every command has a default assertion (and can therefore fail), even by limiting your assertions you’re not saving yourself anything because any single command could implicitly fail.

We see many of our users adding code to an after or afterEach hook in order to clean up the state generated by the current test(s).

Unlike other testing tools - when your tests end - you are left with your working application at the exact point where your test finished.

This enables you to write partial tests that drive your application step by step, writing your test and application code at the same time.

This means your application will behave identically while it is running Cypress commands or when you manually work with it after a test ends.

In order to debug your application or write a partial test, you would always be left commenting out your custom cy.logout() command.

The idea goes like this: After each test I want to ensure the database is reset back to 0 records so when the next test runs, it is run with a clean state.

If, hypothetically, you have written this command because it has to run before the next test does, then the absolute worst place to put it is in an after or afterEach hook.

Because if you refresh Cypress in the middle of the test - you will have built up partial state in the database, and your custom cy.resetDb() function will never get called.

Let’s imagine the following examples: Waiting here is unnecessary since the cy.request() command will not resolve until it receives a response from your server.

Trying to start a web server from cy.exec() or cy.task() causes all kinds of problems because: Why can’t I shut down the process in an after hook?

Having a baseUrl set gives you the added bonus of seeing an error if your server is not running during cypress open at the specified baseUrl.

The most powerful idea in data science

If you take an introductory statistics course, you’ll learn that a datapoint can be used to generate inspiration or to test a theory, but never both.

Even the odd apopheny — from the term apophenia (the human tendency to mistakenly perceive connections and meaning between unrelated things) — can get your creative juices flowing.

(I’ve never mislaid my financial records, but I imagine the United States government wouldn’t be delighted with me if my response to losing them was to use the data imputation techniques I learned in grad school to pay my taxes statistically.) Occasionally, the facts you have aren’t the same as the facts you wish you had.

When you don’t possess all the information required for the decision you’d love to make, you’ll need to navigate uncertainty as you try to pick a reasonable course of action.

Unfortunately, you’ll find the other kinds of patterns in your data too — that’s the big challenge at the heart of data science: how not to wind up less informed as a result of looking at data.

The whole rhetoric of statistical hypothesis testing hinges on surprise, and it’s in bad taste to pretend to be surprised by a pattern you already know is in your data.

If you use up your dataset in your quest for inspiration, you can’t use it again to rigorously test the theory it inspired (no matter how much mathemagical jiu-jitsu you whip out, since math is never a counterspell to basic common sense).

If you have only one dataset, you’re forced to ask yourself: “Do I meditate in a closet, set up all my statistical testing assumptions, and then carefully take a rigorous approach so I can take myself seriously?

If the pattern that inspired you in the first place also exists in the data that didn’t have a chance to influence your opinions, that’s a more promising vote in favor of the pattern being a general thing in the cat litter box you scooped your data from.

(Learn more about the history of data science here.) Some projects still have that problem today, especially in medical research (I used to be in neuroscience, so I have a lot of respect for how hard it is to work with small datasets) but many of you have so much data that you need to hire engineers just to move it all around… what’s your excuse?!

(I explain why in Machine Learning is Automated Inspiration.) To take advantage of the best idea in data science, all you have to do is make sure you keep some test data out of reach of prying eyes, then let your analysts go wild on the rest.

Is manual pattern recognition a valid scientific tool if followed up by statistical analysis?

Yes, what you describe is a bad idea because it biases the p-values strongly.

However, it does happen and it was a factor in many big cases within the replication crisis in social science, including the work of Brian Wansink and Amy Cuddy.

The topic itself has been discussed a tremendous amount in recent years, and you can find more through googling 'p-hacking', 'fishing for significance', or Andrew Gelman's favoured terms 'researcher degrees of freedom' and 'garden of forking paths' (these encompass a great deal more than the behaviour you describe, though).