Data is now being generated at an astounding rate. While it took from the dawn of civilization to 2003 to produce 5 exabytes of information, we now produce that same volume in just two days! Billions of connected devices—ranging from PCs and smartphones to sensor contrivances such as RFID readers and traffic cams—are now contributing to this flood of structured and unstructured data.
This flood of Big Data is bringing disruptive changes in the payments industry. Daily operations produce massive numbers of real-time transactions, and each transaction record contains a large amount of historical data. The huge volumes and dimensionality of business data make it difficult to run real-time analytics using standalone processors; to manage huge storage resources, and to accept and process data from different sources and different formats. Advanced analytics can help firms sift this data to find the actionable information needed for business success.
Big Data technology makes it possible to process and analyze large amounts of data and to track spending patterns day to day, which help identify opportunities for business growth. While Big Data is enabling faster payment operations, it also contributes to the volume and types of data the business must manage.
If real-time analytics is performed on big data using standalone processors, then the potential of big data will be unrealized. Parallelized, multi-core processors can be used to provide better performance, but these resources significantly increase IT investment.
Using cloud technology for Big Data deployments lowers the IT investment costs. It allows businesses to scale up or down quickly by adding or removing resources as needed — paying only for the capacity used. The cloud has made it very easy to implement Big Data applications, which deliver excellent performance and unbelievable speed. The challenge lies in analyzing the related operational data for business intelligence.
The explosion of data from payments systems is in both structured and unstructured data formats. NoSQL databases such as MongoDB® can accept data from different sources and provides faster performance than the alternatives. In its latest version (3.0), this database software is compatible with the WiredTiger open source storage engine, which excels at processing read-and-insert workloads as well as more complex update workloads.
The WiredTiger product features document-level locking functionality and excellent data compression, facilitating even faster operations of the MongoDB. The data compression function can be used to conserve disk space, an important capability in managing Big Data volumes. Either of its two compression algorithms — zlib and snappy — can be used, as per the organization’s data-storage requirement.
Apache Spark™ And Apache Tez™ are trending technologies which enhances the optimization of Big Data for payment platforms. Apache Spark is an open source, general data processing framework in the Apache Hadoop ecosystem that make it easy to develop fast, end-to-end Big Data applications combining batch, streaming, and interactive analytics on all the data. It comes with an in-memory processing engine that provides an SQL interface on top of NoSQL databases. This interface enables the execution of SQL JOIN functions for MongoDB and the generation of real-time analytic reports.
Apache Tez™ is an extensible framework for building high-performance batch- and interactive-data processing applications, coordinated by the YARN function in Apache Hadoop (aka MapReduce 2.0 or MRv2). It improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes (PB) of data. Apache Tez also provides “fit-to-purpose” freedom to create highly optimized, data-processing applications that offer an advantage over end-user-facing engines such as MapReduce and Apache Spark. Its customizable execution architecture allows users to create complex computations as dataflow graphs, permitting dynamic performance optimizations based on real information about the data and the resources required to process it. It also provides highly improved performance for executing queries — 100X faster than Apache Hive™ on MapReduce — and a highly scalable SQL interface that can execute queries scaling from TB to PB.
Unstructured data is ubiquitous. In fact, today most individuals and organizations conduct their lives, as well as their business activities, by processing unstructured data — with their minds and the aid of their smart devices. As with structured data, unstructured data can be generated by machines or people. Some examples of machine-generated, unstructured data include satellite images, scientific sensors, photography, and video recordings. People — more specifically customers and employees — contribute text documents and email messages, convivial media (i.e. contributions to social media channels such as Facebook, Twitter, and LinkedIn), mobile-device data, and ad-hoc Website content (e.g. uploads to YouTube, Flickr, or Instagram).
To assist data managers, companies can overlay, or impose, structure on these unstructured data sources. The following is a step-by-step approach to that end.