More than 70 percent of the tweets analyzed in the study appeared to have been put out by robots, also known as bots, whose use to influence public opinion and sell products while posing as real people is coming under increased scrutiny.

The findings come amid announcements by Twitter recently that it would be removing suspicious and fake accounts by the millions and also introduce new mechanisms to identify and fight spam and abuse on its platform, among other measures.

The study -- one of the first known to rely on geocoded tweets to investigate perceptions of e-cigarettes -- raises important questions about misinformation regarding public health issues and potential covert marketing of nicotine-based products.

Martinez said agencies and public health organizations must be far more attuned to conversations happening in the social media domain if they are to be effective in communicating information to the general public.

'The lack of awareness and need to voice a public health position on e-cigarettes represents a vital opportunity to continue winning gains for tobacco control and prevention efforts through health communication interventions targeting e-cigarettes,' the team wrote in the paper.

After observing anomalies in the dataset, namely related to confusing and illogical posts about e-cigarettes and vaping, the team reviewed user types and decided to reclassify them -- specifically identifying accounts that appeared to be operated by robots.

JMIR Publications

Following sustained increases in the use of e-cigarettes by adults from 2010 to 2013 [1], the prevalence of adult e‑cigarette use plateaued at 3.7% in 2014 and was reported to be much higher among current cigarette smokers (15.9%) [2].

Despite the slight decline in the use of e‑cigarettes by youth from 2014 to 2015, e‑cigarettes remain the most commonly used tobacco product among the middle and high school students in the United States, with 16.0% reporting current use in 2015 [3,4].

Although the long-term health effects of e‑cigarette use are largely unknown, e‑cigarettes commonly contain nicotine, which has negative effects on the adolescent brain [5], along with a range of other chemicals that are harmful to human health [6-10].

Although advertisements for tobacco products have been banned on television since 1971 in the United States, e‑cigarette advertising via television, magazines, outdoor, radio, and Web-based channels has increased dramatically between 2010 and 2013.

In addition to traditional advertising platforms, e‑cigarette–related information and promotional material are widely available through e‑cigarette user forums, Web-based marketing, branded websites, and user-generated content on social media sites such as Twitter and YouTube [17,18].

The majority of youth (81%) and adults (74%) in the United States use some form of social media [19-21], and the microblog, Twitter, has more than 316 million active users creating more than 500 million brief posts (called tweets) daily [22].

Another study found that although the majority of Twitter users engaged in social media conversations about e‑cigarettes are not affiliated with the e‑cigarette industry, e‑cigarette proponents (ie, e‑cigarette marketing or manufacturing representatives, advocates, and enthusiasts) tweet more frequently and are more likely to highlight favorable aspects of e‑cigarette use [27].

Monitoring trends in e‑cigarette–related social media activity requires timely assessment of the content of posts and the types of users generating the content to inform regulatory and surveillance efforts.

In 2016, the Food and Drug Administration (FDA) finalized a rule extending the agency’s authority to regulate e‑cigarettes, which includes federal provisions requiring companies that sell e‑cigarettes to include warning statements about nicotine on advertising and promotional materials, including content on digital/social media.

Furthermore, as public health researchers continue to use social media data to track and understand emerging issues concerning e‑cigarettes, they will need to be able to distinguish between the content from individuals who may be the target of Web-based e‑cigarette advertising (eg, young adults) and the content from e‑cigarette companies, marketers, or spammers who may be posting content for commercial purposes.

Previous studies have used a range of techniques to identify Twitter accounts that are purely automated (robots), human-assisted automated (cyborgs), or organic (ie, individuals) [28] and to distinguish between promotional and nonpromotional tweets [25,29].

Less is known about identifying the diversity of user types responsible for generating e‑cigarette–related content on Twitter, including vape proponents, promotional marketers, automated spammers, public health agencies, news organizations, and individuals.

In a recent study of tweets about e‑cigarettes, Lazard and colleagues [26] analyzed clusters of e‑cigarette topics (eg, marketing-focused personal experience) to categorize tweets as being generated by marketers, individual users, or e‑cigarette proponents.

However, this assessment was based on a review of the topics being discussed (eg, personal experience about e‑cigarette use must be posted by individual users) and was not informed by analysis of user handles that were tweeting the content.

For example, Lazard and colleagues reported that tweets about e‑cigarette policy bans (a common topic cluster identified in the study) were posted by e‑cigarette proponents opposing the ban, but these tweets could have been posted by policy makers announcing or promoting the ban.

A comprehensive search syntax was developed with 158 keywords, including terms such as ecig, vape, and ejuice, as well as popular e‑cigarette brands and hashtags, which resulted in approximately 11.5 million e‑cigarette–related tweets from 2.6 million unique users.

Using a grounded theory approach informed by literature review and guidance from subject matter experts, a protocol was developed for categorizing Twitter users who tweet about e-cigarettes according to the following types: (1) individual, (2) vaper enthusiast, (3) informed agency, (4) marketer, and (5) spammer (see Table 1).

Studies have shown that linguistic content of social media posts is particularly useful because it illustrates the topics of interest to a user and provides information about their lexical usage that may be predictive of certain user types [32,33].

The non-e-cigarette–related tweets were included because most of the user types examined in this study (eg, news media agency, individuals, and public health agencies) do not tweet about e-cigarettes alone.

Then, to evaluate the performance of our tuned GBRT model and the marginal impact of our derived features in improving class differentiation, two separate models were run—one composed of metadata features alone and the other composed of both metadata and derived features.

These two separate models were used to evaluate the marginal impact of adding derived features as metadata features for user profile and tweets are easily obtainable, whereas derived features are more labor intensive to create.

To better understand the contribution of each variable in our modeling outcome, each variable was evaluated using Gini Importance, which is commonly used in ensembles of decision trees as a measure of a variable’s impact in predicting a label that also takes into account estimated error in randomly labeling an observation according to the known label distributions [41].

Several derived data features were also important, including original tweet raw keyword counts (3.7%), profile description keyword count (3.3%), and original tweet cosine similarity mean (3.2%).

The rates of precision and recall for most user types ranged from 78% to 92%, which was well above the baseline dummy classification and serves as a new baseline for the future user type classification of users who tweet about e‑cigarette content on Twitter.

Although using metadata features alone in user classification demonstrates performance gains over dummy classification, the results of this study suggest that including additional tweet-derived features that capture tweeting behavior significantly improves the model performance—an overall F1 score gain of 10.6%—beyond metadata features alone.

Second, vaper enthusiasts are an evolving group of individuals, and their tweeting behavior may therefore vary more than other established user types such as informed agencies (eg, news media and health agencies).

a vaper enthusiast was defined as a user whose primary objective is to promote but not sell e‑cigarette/vaping products, whereas a marketer was defined as a user whose primary objective is to market and sell e‑cigarette/vaping products.

The distinction of promoting but not selling may have been too subtle to pick up, as vaper enthusiasts promote e‑cigarettes by using similar strategies that marketers employ to sell products, such as sharing information about new products, promoting giveaways, and posting product reviews.

With the rise of social influencer marketing, where brands incentivize influencers to promote products or subcultures on social media, it is possible that vaper enthusiast messaging may represent commercial marketing interests.

The vagueness and ambiguity that was observed between the feature spaces of the vaper enthusiast and marketer classes warrants additional research that examines potential relationships between vaper enthusiasts and e-cigarette commercial entities.

In fact, in their study, Kavuluru and Sabbir [27] classified e‑cigarette proponents as “tweeters who represent e‑cigarette sales or marketing agencies, individuals who advocate e‑cigarettes, or tweeters who specifically identify themselves as vapers in their profile bio.”

For example, FDA has the authority to regulate claims made by e‑cigarette companies and will need to monitor e‑cigarette brand social media handles to ensure that they are being compliant with regulatory policies (eg, not making cessation claims, posting warning statements about the harmful effects of nicotine) [43].

Previous studies examining Twitter metadata and linguistic features to predict sociodemographic characteristics of users (eg, gender and age) have extracted up to 3200 tweets per handle, but other researchers have also found that having more than 100 tweets per handle did not necessarily improve the model performance [34].

As social media data are increasingly being used in applied fields such as public health, we need to consider how to balance the resources to conduct this type of analysis with a high level of accuracy and methodological rigor against timeliness and usefulness of the data to inform surveillance and regulatory efforts.

