AI News, Uses of NoSQL database in data science

Uses of NoSQL database in data science

Consider, try, and perhaps even use multiple databases.

http://www.slideshare.net/shift8/mongodb-machine-learning Of course the speed in that case may not be fast enough for your needs, but it is something that's possible.

Performance too of course, but I'm trying to discount benchmarks and focus more on other concerns.

If you want to argue performance ...I don't think you're gonna show me a SQL database that's faster there =) ...and graph databases have some really amazing application based on what you need to do.

18 free and widely used Open Source NoSQL Databases

NoSQL is a new breed of database management systems that fundamentally differ from relational database systems.

These databases do not require tables with a fixed set of columns, avoid JOINs and typically support horizontal scaling.

Here is a list of free and widely used NoSQL databases: MongoDB This highly scalable and agile NoSQL database is a amazing performing system.

Also, you will be provided with benefits like full index support, high availability across WANs and LANs along with easy replication, horizontal scaling, rich queries that are document based, flexibility in data processing and aggregation along with proper training, support and consultation.

This system will help you in running atomic operations like incrementing value present in a hash, set intersection computation, string appending, difference and union.

This application makes scaling extremely easy by providing out-of-the-box support for replication, multi tenancy and sharding.

It provides for easy and predictable scaling and equips users with the ability for quick testing, prototyping and application deployment so as to simplify development.

It provides the programmers with a flexible and object oriented network structure and allows them to enjoy all the benefits of a database that is fully transactional.

Other features include Java API that has an easy client access, table sharding that is configurable and automatic, Bloom filters and block caches and much more.

With this, you will be able to store, sort and retrieve data in your applications with low overhead storage and memory and very high speed.

It is a perfect data platform characterized by fault tolerance and linear scalability along with best in class replication support.

It provides for automatic partitioning of data, transparent handling of server failure, pluggable serialization, independence of nodes and versioning of data items along with support for data distribution across various centers.

This is an application and database framework which gives you the power to bring back object oriented design to web development.

This is an open source system which helps in easy development of full web applications that are based on a graph foundation.

Information Systems for Business and Beyond

Upon successful completion of this chapter, you will be able to: You have already been introduced to the first two components of information systems: hardware and software.

Qualitative data is descriptive. “Ruby Red,” the color of a 2013 Ford Focus, is an example of qualitative data. A number can be qualitative too: if I tell you my favorite number is 5, that is qualitative data because it is descriptive, not the result of a measurement or mathematical calculation.

By adding the context – that the numbers represent the count of students registering for specific classes – I have converted data into information.

In order to do this, the system must be able to take data, put the data into context, and provide tools for aggregation and analysis. A database is designed for just such a purpose.

For example, a database that contains information about students should not also hold information about company stock prices. Databases are not always digital – a filing cabinet, for instance, might be considered a form of database.

In the example below, we have a table of student information, with each row representing a student and each column representing one piece of information about the student.

This key is the unique identifier for each record in the table. To help you understand these terms further, let’s walk through the process of designing a database.

After interviewing several people, the design team learns that the goal of implementing the system is to give better insight into how the university funds clubs.

Using this information, the design team determines that the following tables need to be created: Now that the design team has determined which tables to create, they need to define the specific information that each table will hold. This requires identifying the fields that will be in each table.

However, a primary key cannot change, so this would mean that if students changed their e-mail address we would have to remove them from the database and then re-insert them – not an attractive proposition. Our solution is to create a value for each student — a user ID — that will act as a primary key.

In simple terms, to normalize a database means to design it in a way that: 1) reduces duplication of data between tables and 2) gives the table as much flexibility as possible.

For example, to track memberships, a simple solution might have been to create a Members field in the Clubs table and then just list the names of all of the members there. However, this design would mean that if a student joined two clubs, then his or her information would have to be entered a second time.

In this design, when a student joins their first club, we first must add the student to the Students table, where their first name, last name, e-mail address, and birth year are entered.

For example, if the design team were asked to add functionality to the system to track faculty advisors to the clubs, we could easily accomplish this by adding a Faculty Advisors table (similar to the Students table) and then adding a new field to the Clubs table to hold the Faculty Advisor ID.

Some of the more common data types are listed here: There are two important reasons that we must properly define the data type of a field. First, a data type tells the database what functions can be performed with the data.

While this may not seem like a big deal, if our table ends up holding 50,000 names, we are allocating 50 * 50,000 = 2,500,000 bytes for storage of these values.

While a spreadsheet does allow you to define what kinds of values can be entered into its cells, a database provides more intuitive and powerful ways to define the types of data that go into each field, reducing possible errors and allowing for easier analysis.

An in-depth description of how SQL works is beyond the scope of this introductory text, but these examples should give you an idea of the power of using SQL to manipulate relational data.

The hierarchical database model, popular in the 1960s and 1970s, connected data together in a hierarchy, allowing for a parent/child relationship between data.

For a relational database to work properly, it is important that only one person be able to manipulate a piece of data at a time, a concept known as record-locking.

A NoSQL database can work with data in a looser way, allowing for a more unstructured environment, communicating changes to the data over time to all the servers that are part of the database.

The term scale here refers to a database getting larger and larger, being distributed on a larger number of computers connected via a network.

Understanding the best tools and techniques to manage and analyze these large data sets is a problem that governments and businesses alike are trying to solve.

As organizations have begun to utilize databases as the centerpiece of their operations, the need to fully understand and leverage the data they are collecting has become more and more apparent.

Further, organizations also want to analyze data in a historical sense: How does the data we have today compare with the same set of data this time last month, or last year?

The concept of the data warehouse is simple: extract data from one or more of the organization’s databases and load it into the data warehouse (which is itself another database) for storage and analysis.

A data warehouse should be designed so that it meets the following criteria: There are two primary schools of thought when designing a data warehouse: bottom-up and top-down.

The bottom-up approach starts by creating small data warehouses, called data marts, to solve specific business problems.

The top-down approach suggests that we should start by creating an enterprise-wide data warehouse and then, as specific business needs are identified, create smaller data marts from the data warehouse.

Organizations find data warehouses quite beneficial for a number of reasons: Data mining is the process of analyzing data to find previously unknown trends, patterns, and associations in order to make decisions.

For example, a grocery chain may already have some idea that buying patterns change after it rains and want to get a deeper understanding of exactly what is happening.

In today’s digital world, it is becoming easier than ever to take data from disparate sources and combine them to do new forms of analysis.

These firms combine publicly accessible data with information obtained from the government and other sources to create vast warehouses of data about people and companies that they can then sell.

The term business intelligence is used to describe the process that organizations use to take data they are collecting and analyze it in the hopes of obtaining a competitive advantage.

Knowledge management is the process of formalizing the capture, indexing, and storing of the company’s knowledge in order to benefit from the experiences and insights that the company has captured during its existence.

A database management system (DBMS) is a software application that is used to create and manage databases, and can take the form of a personal DBMS, used by one person, or an enterprise DBMS that can be used by multiple users.

Uses of NoSQL database in data science

Consider, try, and perhaps even use multiple databases.

http://www.slideshare.net/shift8/mongodb-machine-learning Of course the speed in that case may not be fast enough for your needs, but it is something that's possible.

Performance too of course, but I'm trying to discount benchmarks and focus more on other concerns.

If you want to argue performance ...I don't think you're gonna show me a SQL database that's faster there =) ...and graph databases have some really amazing application based on what you need to do.

What does that mean for for the developer in the trenches faced with the task of solving a specific problem and there are a dozen confusing choices and no obvious winner?

It's often hard to take that next step and imagine how their specific problems could be solved in a way that's worth taking the trouble and risk.

Yes, this is part the research for my webinaron December 14th, but I'm a huge believer that people learn best by example, so if we can come up with real specific examples I think that will really help people visualize how they can make the best use of all these new product choices in their own systems.

VoltDB as a relational database is not traditionally thought of as in the NoSQL camp, but I feel based on their radical design perspective they are so far away from Oracle type systems that they are much more in the NoSQL tradition.

The Log: What every software engineer should know about real-time data's unifying abstraction

I joined LinkedIn about six years ago at a particularly interesting time.

We were just beginning to run up against the limits of our monolithic, centralized database and needed to start the transition to a portfolio of specialized distributed systems.

This has been an interesting experience: we built, deployed, and run to this day a distributed graph database, a distributed search backend, a Hadoop installation, and a first and second generation key-value store.

Sometimes called write-ahead logs or commit logs or transaction logs, logs have been around almost as long as computers and are at the heart of many distributed data systems and real-time application architectures.

You can't fully understand databases, NoSQL stores, key value stores, replication, paxos, hadoop, version control, or almost any software system without understanding logs;

In this post, I'll walk you through everything you need to know about logs, including what is log and how to use logs for data integration, real time processing, and system building.

A file is an array of bytes, a table is an array of records, and a log is really just a kind of table or file where the records are sorted by time.

Every programmer is familiar with another definition of logging—the unstructured error messages or trace info an application might write out to a local file using syslog or log4j.

This approach quickly becomes an unmanageable strategy when many services and servers are involved and the purpose of logs quickly becomes as an input to queries and graphs to understand behavior across many machines—something for which english text in files is not nearly as appropriate as the kind structured log described here.)

To make this atomic and durable, a database uses a log to write out information about the records they will be modifying, before applying the changes to all the various data structures it maintains.

Oracle has productized the log as a general data subscription mechanism for non-oracle data subscribers with their XStreams and GoldenGate and similar facilities in MySQL and PostgreSQL are key components of many data architectures.

For example a program whose output is influenced by the particular order of execution of threads or by a call to gettimeofday or some other non-repeatable thing is generally best considered as non-deterministic.

The purpose of the log here is to squeeze all the non-determinism out of the input stream to ensure that each replica processing this input stays in sync.

One of the beautiful things about this approach is that the time stamps that index the log now act as the clock for the state of the replicas—you can describe each replica by a single number, the timestamp for the maximum log entry it has processed.

For example, we can log the incoming requests to a service, or the state changes the service undergoes in response to request, or the transformation commands it executes.

Logical logging means logging not the changed rows but the SQL commands that lead to the row changes (the insert, update, and delete statements).

A slight modification of this, called the "primary-backup model", is to elect one replica as the leader and allow this leader to process requests in the order they arrive and log out the changes to its state from processing the requests.

With Paxos, this is usually done using an extension of the protocol called "multi-paxos", which models the log as a series of consensus problems, one for each slot in the log.

My suspicion is that our view of this is a little bit biased by the path of history, perhaps due to the few decades in which the theory of distributed computing outpaced its practical application.

I suspect we will end up focusing more on the log as a commoditized building block irrespective of its implementation in the same way we often talk about a hash table without bothering to get in the details of whether we mean the murmur hash with linear probing or some other variant.

There is a sense in which the log is the more fundamental data structure: in addition to creating the original table you can also transform it to create all kinds of derived tables.

The magic of the log is that if it is a complete log of changes, it holds not only the contents of the final version of the table, but also allows recreating all other versions that might have existed.

You will note that in version control systems, as in other distributed stateful systems, replication happens via the log: when you update, you pull down just the patches and apply them to your current snapshot.

In the remainder of this article I will try to give a flavor of what a log is good for that goes beyond the internals of distributed computing or abstract distributed computing models.

In each case, the usefulness of the log comes from simple function that the log provides: producing a persistent, re-playable record of history.

You don't hear much about data integration in all the breathless interest and hype around the idea of big data, but nonetheless, I believe this mundane problem of "making the data available"

The base of the pyramid involves capturing all the relevant data, being able to put it together in an applicable processing environment (be that a fancy real-time query system or just text files and python scripts).

Once these basic needs of capturing data in a uniform way are taken care of it is reasonable to work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.

It's worth noting the obvious: without a reliable and complete data flow, a Hadoop cluster is little more than a very expensive and difficult to assemble space heater.

In my experience, most organizations have huge holes in the base of this pyramid—they lack reliable complete data flow—but want to jump directly to advanced data modeling techniques.

In web systems, this means user activity logging, but also the machine-level events and statistics required to reliably operate and monitor a data center's worth of machines.

This data is at the heart of the modern web: Google's fortune, after all, is generated by a relevance pipeline built on clicks and impressions—that is, events.

The second trend comes from the explosion of specialized data systems that have become popular and often freely available in the last five years.

A data source could be an application that logs out events (say clicks or page views), or a database table that accepts modifications.

A batch system such as Hadoop or a data warehouse may consume only hourly or daily, whereas a real-time query system may need to be up-to-the-second.

Neither the originating data source nor the log has knowledge of the various data destination systems, so consumer systems can be added and removed with no change in the pipeline.

The consumer system need not concern itself with whether the data came from an RDBMS, a new-fangled key-value store, or was generated without a real-time query system of any kind.

doesn't imply much more than indirect addressing of messages—if you compare any two messaging systems promising publish-subscribe, you find that they guarantee very different things, and most models are not useful in this domain.

That isn't the end of the story of mastering data flow: the rest of the story is around metadata, schemas, compatibility, and all the details of handling data structure and evolution.

One of the earliest pieces of infrastructure we developed was a service called databus that provided a log caching abstraction on top of our early Oracle tables to scale subscription to database changes so we could feed our social graph and search indexes.

Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy prediction algorithms.

Worse, any time there was a problem in any of the pipelines, the Hadoop system was largely useless—running fancy algorithms on bad data just produces more bad data.

If we captured all the structure we needed, we could make Hadoop data loads fully automatic, so that no manual effort was expanded adding new data sources or handling schema changes—data would just magically appear in HDFS and Hive tables would automatically be generated for new data sources with the appropriate columns.

The idea is that adding a new data system—be it a data source or a data destination—should create integration work only to connect it to a single pipeline instead of each consumer of data.

This experience lead me to focus on building Kafka to combine what we had seen in messaging systems with the log concept popular in databases and distributed system internals.

For a long time, Kafka was a little unique (some would say odd) as an infrastructure product—neither a database nor a log file collection system nor a traditional messaging system.

The similarity goes right down to the way partitioning is handled, data is retained, and the fairly odd split in the Kafka API between high- and low-level consumers.

For those not in the know, the data warehousing methodology involves periodically extracting data from source databases, munging it into some kind of understandable form, and loading it into a central data warehouse.

A data warehouse is a piece of batch query infrastructure which is well suited to many kinds of reporting and ad hoc analysis, particularly when the queries involve simple counting, aggregation, and filtering.

But having a batch system be the only repository of clean complete data means the data is unavailable for systems requiring a real-time feed—real-time processing, search indexing, monitoring systems, etc.

The incentives are not aligned: data producers are often not very aware of the use of the data in the data warehouse and end up creating data that is hard to extract or requires heavy, hard to scale transformation to get into usable form.

Of course, the central team never quite manages to scale to match the pace of the rest of the organization, so data coverage is always spotty, data flow is fragile, and changes are slow.

This means that as part of their system design and implementation they must consider the problem of getting data out and into a well structured form for delivery to the central pipeline.

The data warehouse team handles only the simpler problem of loading structured feeds of data from the central log and carrying out transformation specific to their system.

This point about organizational scalability becomes particularly important when one considers adopting additional data systems beyond a traditional data warehouse.

Worse, the ETL processing pipeline built to support database loads is likely of no use for feeding these other systems, making bootstrapping these pieces of infrastructure as large an undertaking as adopting a data warehouse.

By contrast, if the organization had built out feeds of uniform, well-structured data, getting any new system full access to all data requires only a single bit of integration plumbing to attach to the pipeline.

The typical approach to activity data in the web industry is to log it out to text files where it can be scrapped into a data warehouse or into Hadoop for aggregation and querying.

Worse, the systems that we need to interface with are now somewhat intertwined—the person working on displaying jobs needs to know about many other systems and features and make sure they are integrated properly.

The job display page now just shows a job and records the fact that a job was shown along with the relevant attributes of the job, the viewer, and any other useful facts about the display of the job.

Each of the other interested systems—the recommendation system, the security system, the job poster analytics system, and the data warehouse—all just subscribe to the feed and do their processing.

Using a log as a universal integration mechanism is never going to be more than an elegant fantasy if we can't build a log that is fast, cheap, and scalable enough to make this practical at scale.

At LinkedIn we are currently running over 60 billion unique message writes through Kafka per day (several hundred billion if you count the writes from mirroring between datacenters).

Instead, the guarantees that we provide are that each partition is order preserving, and Kafka guarantees that appends to a particular partition from a single sender will be delivered in the order they are sent.

Batching occurs from client to server when sending data, in writes to disk, in replication between servers, in data transfer to consumers, and in acknowledging committed data.

The cumulative effect of these optimizations is that you can usually write and read data at the rate supported by the disk or network, even while maintaining data sets that vastly exceed memory.

If you are a fan of late 90s and early 2000s database literature or semi-successful data infrastructure products, you likely associate stream processing with efforts to build a SQL engine or "boxes and arrows"

There is no inherent reason you can't process the stream of data from yesterday or a month ago using a variety of different languages to express the computation.

Data collection at the time was inherently batch oriented, it involved riding around on horseback and writing down records on paper, then transporting this batch of records to a central location where humans added up all the counts.

These days, when you describe the census process one immediately wonders why we don't keep a journal of births and deaths and produce population counts either continuously or with whatever granularity is needed.

But as these processes are replaced with continuous feeds, one naturally starts to move towards continuous processing to smooth out the processing resources needed and reduce latency.

When data is collected in batches, it is almost always due to some manual step or lack of digitization or is a historical relic left over from the automation of some non-digital process.

Seen in this light, it is easy to have a different view of stream processing: it is just processing which includes a notion of time in the underlying data being processed and does not require a static snapshot of the data so it can produce output at a user-controlled frequency instead of waiting for the "end"

Companies building stream processing systems focused on providing processing engines to attach to real-time data streams, but it turned out that at the time very few people actually had real-time data streams.

Actually, very early at my career at LinkedIn, a company tried to sell us a very cool stream processing system, but since all our data was collected in hourly files at that time, the best application we could come up with was to pipe the hourly files into the stream system at the end of the hour!

The exception actually proves the rule here: finance, the one domain where stream processing has met with some success, was exactly the area where real-time data streams were already the norm and processing had become the bottleneck.

It turns out that the log solves some of the most critical technical problems in stream processing, which I'll describe, but the biggest problem that it solves is just making data available in real-time multi-subscriber data feeds.

The most interesting aspect of stream processing has nothing to do with the internals of a stream processing system, but instead has to do with how it extends our idea of what a data feed is from the earlier data integration discussion.

Indeed, using a centralized log in this fashion, you can view all the organization's data capture, transformation, and flow as just a series of logs and processes that write to them.

A stream processor need not have a fancy framework at all: it can be any process or set of processes that read and write from logs, but additional infrastructure and support can be provided for helping manage processing code.

If processing proceeds in an unsynchronized fashion it is likely to happen that an upstream data producing job will produce data more quickly than another downstream job can consume it.

One might, for example, want to enrich an event stream (say a stream of clicks) with information about the user doing the click—in effect joining the click stream to the user account database.

Invariably, this kind of processing ends up requiring some kind of state to be maintained by the processor: for example, when computing a count, you have the count so far to maintain.

This gives us exactly the tool to be able to convert streams to tables co-located with our processing, as well as a mechanism for handling fault tolerance for these tables.

This mechanism allows a generic mechanism for keeping co-partitioned state in arbitrary index types local with the incoming stream data.

For keyed data, though, a nice property of the complete log is that you can replay it to recreate the state of the source system (potentially recreating it in another system).

By doing this, we still guarantee that the log contains a complete backup of the source system, but now we can no longer recreate all previous states of the source system, only the more recent ones.

There is an analogy here between the role a log serves for data flow inside a distributed database and the role it serves for data integration in a larger organization.

But these issues can be addressed by a good system: it is possible for an organization to have a single Hadoop cluster, for example, that contains all the data and serves a large and diverse constituency.

So there is already one possible simplification in the handling of data that has become possible in the move to distributed systems: coalescing lots of little instances of each system into a few big clusters.

This is clearly not a story relevant to end-users who presumably care primarily more about the API then how it is implemented, but it might be a path towards getting the simplicity of the single system in a more diverse and modular world that continues to evolve.

If the implementation time for a distributed system goes from years to weeks because reliable, flexible building blocks emerge, then the pressure to coalesce into a single monolithic system disappears.

This is exactly the part that should vary from system to system: for example, a full-text search query may need to query all partitions whereas a query by primary key may only need to query a single node responsible for that key's data.

The serving nodes store whatever index is required to serve queries (for example a key-value store might have something like a btree or sstable, a search system would have an inverted index).

The client can get read-your-write semantics from any node by providing the timestamp of a write as part of its query—a serving node receiving such a query will compare the desired timestamp to its own index point and if necessary delay the request until it has indexed up to at least that time to avoid serving stale data.

I find this view of systems as factored into a log and query api to very revealing, as it lets you separate the query characteristics from the availability and consistency aspects of the system.

These systems feed off a database (using Databus as a log abstraction or off a dedicated log from Kafka) and provide a particular partitioning, indexing, and query capability on top of that data stream.

In fact, it is quite common to have a single data feed (whether a live feed or a derived feed coming from Hadoop) replicated into multiple serving systems for live serving.

None of these systems need to have an externally accessible write api at all, Kafka and databases are used as the system of record and changes flow to the appropriate query systems through that log.

Everyone seems to uses different terms for the same things so it is a bit of a puzzle to connect the database literature to the distributed systems stuff to the various enterprise software camps to the open source world.

Bigtable, BigQuery, and iCharts for ingesting and visualizing data at scale (Google Cloud Next '17)

Bigtable is a low-latency, high-throughput NoSQL analytical database. BigQuery is a high-performance data warehouse with a SQL API. In this talk, Carter Page ...

Introduction to Neo4j Graph Platform

Jonny Cheetham, Neo4j.

Baron Schwartz: MySQL, SQL, NoSQL, and Open Source in 2014 and Beyond

Google Tech Talk February 13, 2014 (more info below) Presented by Baron Schwartz (Co-Founder & CEO of VividCortex) ABSTRACT Predictions are hard to ...

17-02-nosql-overview.mp4

Help us caption and translate this video on Amara.org: Help us caption & translate this video!

Introduction to big data: tools that deliver deep insights (Google Cloud Next '17)

In this video, William Vambenepe explains some big data basics and provide a survey-level introduction to Google Cloud's big data tools, including Google ...

Cloud’s analytics tools for collecting & visualizing game telemetry & data (Google Cloud Next ‘17)

In this video, Indranil Chakraborty, Venkat Rapaka, and José Ugia share an overview of the products that will help you build a complete IoT solution on Google ...

Running open-source Databases on Google Cloud Platform (Google Cloud Next '17)

Learn about the various options for running open-source databases on GCP, both self-managed and fully-managed. We will also do a deep dive with Quizlet ...

Digital Transformation and the Journey to a Highly Connected Enterprise

Jeff Morris, from the Neo4j product team, discusses the connected data paradigm and the trends in digital transformation.

Application modernization with Microsoft Azure | G100

Bringing billions of eBay listings together using Cloud Bigtable to support new shopping experiences

Googler Misha Brukman and Leon Stein from eBay discuss how Cloud Bigtable handles eBay's global catalog with billions of listings, scaling to hundreds of ...