Industry Financial Services
Specialization Or Business Function
Technical Function Analytics
Technology & Tools Machine Learning Frameworks (TensorFlow)
Here is the problem statment:
1. Given a set of feature vectors where each vector represents features found in a given column of data, cluster these columns into a set of discrete domains based on the similarity of the feature vectors. I'd like to use Latent Dirichlet Allocation as the clustering technique.
2. Next, given a set of graphs, where each graph represents a relational database schema, each node represents a table in a schema, and each edge represents a table relationship, construct a set of domain graphs. Table nodes contain one or more columns. Columns are mapped feature vectors. These were mapped to domains in step #1. Domain graphs are representations of the common domain relationships found in the source graphs. Each node is a domain. Each edge is a relationship derived from the underlying column table graph. The intent is for each of these domain-graphs to represent different commonly occurring conceptual data models. One model might represent the related domains found in an HR system representing an employee record. Another might represent customer information as found in a CRM system. And another might have elements representing both customers and employees such as what might be found in a Sales Compensation application database.
3. Refine the domain classification algorithm used in step one by considering adjacent domains found in one or more domain graphs. For example, compute the probability that a column having values ranging from 300 to 800 is a FICO score, given the column is part of a graph that commonly includes a FICO score data element. In other words, P (C1 = FICO-DOMAIN | C1 is part of DG1) is the probability, given that column C1 is part of domain graph DG1, and that DG1 has some probability of including a FICO-DOMAIN. We want to improve our ability to resolve columns having very similar features but represent very different data types. Columns having ranges of numeric values or timestamps are typically the most problematic.
4. Design a method using TensorFlow to combine independently derived domain-graph models (e.g. #2 above being computed in parallel by geographically distributed data profiling processes), such that the domain and domain graph representations (steps #1 and #3 above) eventually converge on a common set of domains and domain-graphs.
Deliverables
Dataset Provided
Please provide your approach for the four problems above and tell us how many hours it will take you to complete this project.
Matching Providers