Data-Blind ML: Building privacy-aware machine learning models without direct data access
Traditional Machine Learning (ML) pipeline development requires the ML practitioner to directly access the data to analyze, clean and preprocess it, in order to develop an ML model, train it and evaluate its performance. When the data owner has no infrastructure for in-house development, such pipelines are outsourced. It is common that data has some level of privacy constraints that will impose a laborious and maybe expensive infrastructure, including among others contracts drafting and infrastructure improvement. Traditional approaches rely either on anonymization which does not entirely protect from identity disclosure, or on synthetic data generation which requires expertise not necessarily available to the organization. In this paper, we present Data-Blind ML, an automated framework, fueled by synthetic generative learning and distributed computing paradigms, which enables an organization to outsource the development and training of ML models without sharing any sample from the real dataset. In addition, the framework allows the ML practitioner to get feedback of the model's performance against the actual real data without accessing it directly.