Mike Dwell
2022-01-29 05:37:14 UTC
Hi there,
I have been bumping my head against wall in trying to figure out a good real-world solution for this challenging problem that my friend asked me.
Could you please give some pointers?
Lets say we want to assess the data quality of Company A's big data. Due to both security, privacy and work-load concerns, it's impossible to view/access the whole data repository(data-lake or data-ocean) of A.
We can only request a sample of Company A's big data and then hopefully we can apply some quality-assess-toolkit to do some analysis.
My question is: how to draw such a data sample? what requirements should we set up for such a data sample?
Moreover, Company A may "optimize" or "decorate" the sample data that he gives out, what might be a good scheme or mechanism design such that we can avoid his "optimization" or "decoration"?
Could anybody please give some pointers?
Thanks a lot!
I have been bumping my head against wall in trying to figure out a good real-world solution for this challenging problem that my friend asked me.
Could you please give some pointers?
Lets say we want to assess the data quality of Company A's big data. Due to both security, privacy and work-load concerns, it's impossible to view/access the whole data repository(data-lake or data-ocean) of A.
We can only request a sample of Company A's big data and then hopefully we can apply some quality-assess-toolkit to do some analysis.
My question is: how to draw such a data sample? what requirements should we set up for such a data sample?
Moreover, Company A may "optimize" or "decorate" the sample data that he gives out, what might be a good scheme or mechanism design such that we can avoid his "optimization" or "decoration"?
Could anybody please give some pointers?
Thanks a lot!