AI News, Engineering Data Science atAutomattic

Engineering Data Science atAutomattic

Most data scientists have to write code to analyze data or build products.

Hence, data scientists tend to come from various backgrounds, and it is common to encounter data scientists with no formal training in computer science or software engineering.

For example, in a post on software development skills for data scientists, Trey Causey describes the following imaginary dialogue between a software engineer and a data scientist: While this is an extreme example, it is a good demonstration of the knowledge gaps of new Type A data scientists.

However, in the past few months we have improved our use of version control in the following ways: To make it easier to share code across projects, we converted our collection of reusable scripts to a private Conda package.

In other cases, the shared code was just assumed to be at a specific local path, which made it hard to reproduce results if the shared code changed in ways that broke the dependent project.

While code reviews are known to be one of the most effective ways of improving code quality, their benefits extend beyond finding issues in the code that’s being reviewed.

Some of those benefits were listed by Kerry Liu (one of Automattic’s JavaScript engineers) in an internal guide to code reviews that is also available as a blog post: In addition to improving the quality of our projects, code reviews and our other efforts in adopting best practices address the main problem identified in Trey Causey’s article —

that “many new data scientists don’t know how to effectively collaborate.” As with all things that we do, we are committed to never stop learning, and strive to further improve our processes.

To get the answers, I asked Dr. Nicole Forsgren, director of organizational performance and analytics at Chef Software, and Ohad Assulin, chief data scientist at Hewlett Packard Enterprise Software, to explain what data scientists actually do and how you as a software engineer can work effectively with them—and perhaps add a few of those in-demand data science skills to your own CV.

While the traditional BI role was typically more database-centric, often analyzing offline data, data scientists tend to have a stronger background in statistics, predictive analytics techniques, and the implementation of algorithms on real-time or near-real-time data.

To understand what data science means for software developers, you need to understand the answers to three questions: To make your SDLC process more efficient, Forsgren says, you need to think about your goal and keep in mind that performance and effectiveness are best measured at the team level, rather than at the individual level.

If that data isn’t available, the data scientist will need to work with the developers and the operations engineers to make that data available by getting access to the source code repository (such as Git).

But a good data scientist first takes a step back, asking, “What are the questions I can ask?” and, “What data do I need to answer them?” The data scientistmay need to ask developers to add hooks to capture additional data, if the existing production data is insufficient.

When data scientists are developing software, they could be writing anything from pseudo-code to fully productized code, for things from data collection to number crunching to visualizing and presenting the results.

If you’re asking for insight into the kinds of problems on which they can help or an analysis based on data, you’ll get a report or presentation expressed in plain business language that all stakeholders can understand.

Assulin says that data scientists must give the consumer something that’s easy to work with, whether in the form of a library or microservice, that integrates easily with the main product’s code.

However, Assulin cautions that a lone data scientist may be limited by not having anyone else to bounce ideas off of, and the highly mathematical nature of the code can make code reviews difficult.

Startups with a well-defined data problem should include their data scientists in the teams, whereas larger organizations with a variety of problems and data will do better with a team of data scientists who can support one another while providing data science services to the rest of the organization.

One of Assulin’s roles is to educate business analysts and product owners on techniques and analytical tools that are available to the business, such as explaining how the data scientist can make predictions about the future based on past history, gather insights with data clustering, or make recommendations based on user behavior.

Forsgren agrees and adds that you should also ask business-related questions, such as, “How do you see a data scientist adding value to my business?” The first data scientist to join your business should have initiative and understand what value he or she can bring.

Less experienced candidates should be able to cite at least some contribution to a data science project, for example as part of a data science boot camp or university-level project.

Assulin says that when he’s recruiting data scientists, his baseline is a computer science degree, or at least significant experience in software development, because data scientists are expected to write production-level code that is part of the product.

As a developer with a clear understanding of data science concepts and how data scientists work, you'll be positioned to collaborate with data scientists while expanding your own expertise in this growing discipline.

A Beginner’s Guide to Data Engineering — Part I

The more experienced I become as a data scientist, the more convinced I am that data engineering is one of the most critical and foundational skills in any data scientist’s toolkit.

In an earlier post, I pointed out that a data scientist’s capability to convert data into value is largely correlated with the stage of her company’s data infrastructure as well as how mature its data warehouse is.

Furthermore, many of the great data scientists I know are not only strong in data science but are also strategic in leveraging data engineering as an adjacent discipline to take on larger and more ambitious projects that are otherwise not reachable.

Given that I am now a huge proponent for learning data engineering as an adjacent discipline, you might find it surprising that I had the completely opposite opinion a few years ago — I struggled a lot with data engineering during my first job, both motivationally and emotionally.

Instead, my job was much more foundational — to maintain critical pipelines to track how many users visited our site, how much time each reader spent reading contents, and how often people liked or retweeted articles.

I was thrown into the wild west of raw data, far away from the comfortable land of pre-processed, tidy .csv files, and I felt unprepared and uncomfortable working in an environment where this is the norm.

Over time, I discovered the concept of instrumentation, hustled with machine-generated logs, parsed many URLs and timestamps, and most importantly, learned SQL (Yes, in case you were wondering, my only exposure to SQL prior to my first job was Jennifer Widom’s awesome MOOC here).

Nowadays, I understand counting carefully and intelligently is what analytics is largely about, and this type of foundational work is especially important when we live in a world filled with constant buzzwords and hypes.

Among the many advocates who pointed out the discrepancy between the grinding aspect of data science and the rosier depictions that media sometimes portrayed, I especially enjoyed Monica Rogati’s call out, in which she warned against companies who are eager to adopt AI: This framework puts things into perspective.

One of the recipes for disaster is for startups to hire its first data contributor as someone who only specialized in modeling but have little or no experience in building the foundational layers that is the pre-requisite of everything else (I called this “The Hiring Out-of-Order Problem”).

Even for modern courses that encourage students to scrape, prepare, or access raw data through public APIs, most of them do not teach students how to properly design table schemas or build data pipelines.

Maxime Beauchemin, the original author of Airflow, characterized data engineering in his fantastic post The Rise of Data Engineer: Among the many valuable things that data engineers do, one of their highly sought-after skills is the ability to design, build, and maintain data warehouses.

Below are a few specific examples that highlight the role of data warehousing for different companies in various stages: Without these foundational warehouses, every activity related to data science becomes either too expensive or not scalable.

To understand this flow more concretely, I found the following picture from Robinhood’s engineering blog very useful: While all ETL jobs follow this common pattern, the actual jobs themselves can be very different in usage, utility, and complexity.

Here is a very simple toy example of an Airflow job: The example above simply prints the date in bash every day after waiting for a second to pass after the execution date is reached, but real-life ETL jobs can be much more complex.

Another ETL can take in some experiment configuration file, compute the relevant metrics for that experiment, and finally output p-values and confidence intervals in a UI to inform us whether the product change is preventing from user churn.

Regardless of the framework that you choose to adopt, a few features are important to consider: Naturally, as someone who works at Airbnb, I really enjoy using Airflow and I really appreciate how it elegantly addresses a lot of the common problems that I encountered during data engineering work.

Computer Science VS Software Engineering — Which Major Is Best For You?

Two of the most common questions my audience asks me are: And… In this article, I’ll answer this and give you my own quick analysis on these majors.

For each major, here are some of the titles alumni hold, and where they work: As you can see, there isn’t a huge difference between the types of jobs you can get.

They both cover a few fundamental computer science courses, and a few math courses in linear algebra and calculus.

The core computer science requirements are similar as well, ranging over algorithms, data structures, and operating systems.

At this particular university (University of Waterloo), with this particular set of program requirements, Computer Science is a better major if you want to be a software engineer.

Just for simplicity, let’s suppose that you are hoping to get one the highest paying jobs (~$100,000 USD / year) as a software engineer in North America.

These jobs are typically at large software companies (think Microsoft, Google, Amazon, etc.) or at medium-sized, high-growth companies (think Dropbox, Lyft, Snapchat, Pinterest, etc.).

Typically, what they look for in a software engineer candidate is the ability to write solid code and build interesting projects, as well as computer science fundamentals including data structures and algorithms.

think the best way to cultivate this skill set is by quickly learning computer science fundamentals, and spending your own time practicing solving problems and writing code.

I’m sure there are some benefits to learning software engineering fundamentals (project management, design, testing, etc.).

This article should be a good starting point, but you should still take a look at the program requirements at the university you’re interested in attending.

Software Engineer (Applied Data Science) - Revenue Science, Ads Marketplace

Who We Are: The ads marketplace team is responsible for placing each and every ad that Twitter serves.

Our team implements and builds software frameworks for the revenue marketplace, optimizes the ad delivery engine and manages demand/supply by employing software engineering and applied data science skills.

You will build high-quality software at scale, experiment, make data-driven decisions, optimize for impact, measure our product funnels, and apply machine learning and data science.

The small teams of talented, passionate people in which you’ll work will include engineers and data scientists from across the revenue engineering organization.

4 Things Data Scientists Should Learn From Software Engineers

Version control systems (VCS) are great tools not only because they let us share and sync the same code (or even just files) between different team members.

These bad practices have a double effect, on the one hand the data scientist feel’s that the tool doesn’t bring him enough value while he need to invest much time in making it work (complicated merge, the code has been changed in an unexpected way and more).

While doing ML research, I think that there are two main places where automation is crucial for doing efficient research: In regular software development there is a clear distinguish between development efforts and testing efforts.

While having bugs in your research code might not impact the functionality of your system, it does however, could lead you to wrong conclusions, and non-optimal solutions (which actually are much harder to discover relative to functional failures, as they require monitor the quality of your model in production over time).

For the research phase, I do think that keeping a clear and a standard coding formats, could save a lot of time, either if it’s for a peer who need to use a code that somebody else wrote, or even if it’s for a single data scientist who consume its own code.

Such format code standards may include: Clear project structure, a standard function and variable naming convention, having a modular functional code (instead of very long single script), having comments in place, removing redundant dependancies, code indentation, and more.

How to Make $100/hr as a Freelance Software Developer (No Degree Required)

Start learning python by building projects in under 5 minutes TODAY – Even if you're a complete beginner... ..

Sergii Khomenko - From Data Science to Production - deploy, scale, enjoy!

PyData Amsterdam 2016 Description Data cleaning is the first step of every Data Science project. Next one does Data Science. The talk covers a missing step of ...

Nitin Borwankar | Applying machine learning to software development to reduce bugs

PyData SF 2016 Nitin Borwankar | Applying machine learning to software development to reduce bugs This talk shows how we can reduce risk of failure in ...

Top 10 High Paying Software Jobs - Check Out What It takes ?

Top 10 High Paying Software Jobs - Check Out What It takes ? Watch this video to find out highest paying software jobs. If you're a software engineer, database ...

How I Got a Job at Google as a Software Engineer (without a Computer Science Degree!)

How to get a job at Google: Here are the 6 steps I personally used for getting a job at Google as a software engineer (without a computer science degree).

Karolina Alexiou - Patterns for Collaboration between Data Scientists And Software Engineers

Description The talk is going to present, with examples, how a software engineer team can work together with data scientists (both in-house and external ...

Python for data science

The Python language combines human-friendly syntax, awesome libraries, and computational chops into one of the most powerful languages in the world today.

DSDJ Mentoring Call 5-24-18

Here are the questions (timepoints marked below) that Kyle and Jean-Sebastien (one of our stellar DSDJ students) went over in last week's live mentoring ...