Data Quality (DQ) is the first step towards achieving strong Data governance in any payment system or financial data warehouse/lake. Ensuring DQ in big data space, especially, is more challenging than implementing DQ in traditional DW/RDMS.
Traditional systems have more mechanisms to control or do corrective course during data capture or storage where as Big data has some specific challenges doing course correction to ensure DQ due to obvious reasons.
Big Data Lake – usually has large volume & variety data from various business systems and implementing DQ in this case requires lot of collaboration from all business stakeholders and it is very important to ensure DQ in big data space to get accurate simulations/meaningful predictions/precise reports.
There are some critical checks that any DQ activity involves.
The above critical DQ checks should be implemented on below major set of data categories:
Few examples of DQ challenges…
There are many more complex challenges and that require data engineers to think of optimized solutions without hurting business SLA/cost.
Poor data/absence of data quality checks will impact critical business decisions and Analytics will not provide accurate insights to any payment business desperate to have the Analytics wheel rotating all the time to achieve the business goals (sales, increase customer base, regulatory compliance. etc. )..
Greater importance of DQ in payments with big data ecosystem is a promising path to accessing the key data quality benefits but it is possible when you have the right tools to get the most accurate information.
DQ issues arise primarily due to bad data entry by humans, junk data coming from system(s), lack of data standards across the board and the lack of ownership. These are the reasons that majorly contribute to poor data and these issues affect the downstream application or data warehouse badly, because each downstream layer spends lot of efforts for course correction that may leads to failure in achieving business goals.
The approach to address poor data is parallel in a way that both business and IT should work together to fix the root cause issue at the source system and big data warehouse should be able to find more poor data quality issues to keep rotating this cycle until all prioritized/critical data quality is ensured by all stakeholders.
The first approach is time consuming and may take years depend on list of prioritized issues and number of application and time taken for a cycle to fix DQ issue.
An alternate approach is less complex and less time consuming because scope of DQ within Data Warehouse especially big data Hadoop warehouse has huge potential and power to address DQ issues with help of correct framework or tool that helps on the fly DQ checks without hurting business delivery SLAs.
While some of the use cases in big data eco system may not emphasize on data quality but there are lot of use cases that require strict data quality management in place to utilize full potential of analytics.
Some the key steps that should be taken or agreed upon are enlisted below:
When it comes to DQ in the big data world, the below factors are to be analyzed before start implementing data quality:
There are some products available in the market that come with DQ modules but study the above factors with respect to the tool which fits for your project use case.