facebook-pixel

Data Engineer Consultant to Implement a Business Intelligence Pipeline

Industry Pharmaceutical and Life Sciences, Healthcare, Hi-Tech

Specialization Or Business Function

Technical Function Data Management, Data Engineering

Technology & Tools Big Data and Cloud (Amazon Elastic MapReduce, Apache Hadoop, Apache Spark), Data Warehouse Appliances, Programming Languages and Frameworks (Python)

WORK IN PROGRESS

Project Description

We are a 3-year-old start-up that offers consumers a ground-breaking, personalized and preventative approach to improving their health.  

We are seeking an experienced Data Engineer to consult on, build and document a workflow for our Business Intelligence pipeline. Success criteria incluces 100% automation, and the ability to transition to a competent staff of [primarily software] engineers.

The business processes data from multiple sources on a daily basis to fuel its growing health and wellness platform.bThe platform currently runs on Spark within AWS, and we are strongly considering using AWS EMR. 

This is a 2-4 month position depending on the depth of analysis and development. 

Project applicants will need to: 

  1. Assess the extract process from our set of ~100 relational tables on Postgres into an AWS S3 bucket.Ideally this process will extract only the data that has changed since the previous extraction.Total daily data processed would be less than 1GB. 
  2. Recommend, build, and document a solution for both a staging and production environment. 
  3. Included with the extraction process, we must include two additional data sources:  (MixPanel – our data partner that collects usage for our mobile apps and dashboards, custom CSV data stored on sharepoint) 
  4. Preferably an extraction solution would include instructions for easily adding additional data sources. 
  5. Implement and document an EMR job that translates the extractions into a new S3 bucket that can be read using our Data Warehouse platform.  The current setup runs on a single server and we would like to leverage the distributed capabilities of EMR. 

We should consider costs as part of this exercise and consider alternatives if appropriate.  All translations currently exist in python scripts and could ideally be reused. 

We currently use Snowflake to load the data from the S3 bucket and made available in PowerBI to build and run reports.  Either continue to use this existing infrastructure or make a recommendation for a more cost effective and/or scalable solution. 

We require expertise (>3 years of experience) with the following technologies: 

  • Postgres 
  • Elastic Map Reduce 
  • Spark/Hadoop 
  • Snowflake (or similar Data Warehousing software) 
  • Python (or similar language for loading data) 

 

Milestones 

  1. Single use case
  • Execute HIPAA compliance agreement and take 20 minute HIPAA training.
  • Configure existing environment
  • Collect necessary permissions
  • Build a single lambda job that extracts a single table (via a view), transforms the data using an existing .sql transform (if possible), dumps the data into redshift (or an agreed upon alternative)
  • Verify that data is available from PowerBI
  1. Single use case with incremental data
  • Implement views on top of Postgres tables to protect against PII/PHI data breach, and to protect against schema changes breaking the pipeline (or alternative approach to address this problem)
  • Ensure that only modified data is copied to redshift
  • Estimate duration of full ETL, and determine if 15 minute limitation within AWS lambda is appropriate
  1. Migrate all existing necessary transforms to new architecture (verify with Ben if any appropriate schema changes should be made) - This includes all three data sources (
  2. Build a parallel staging environment, develop release process for lambda functions and/or redshift schema changes (agreed upon by Mark and dev team)
  3. Scheduling/automation of pipeline

  • Error handling to Sentry/Sumo
  • Determine the proper cadence (hourly?)
  • Build a test script to verify the success of the full ETL (prod and staging)
    1. Documentation of procedures for future changes

    Project Overview

    • Posted
      December 10, 2018
    • Planned Start
      January 02, 2019
    • Delivery Date
      March 05, 2019
    • Preferred Location
      Seattle, Washington, United States

    Client Overview


    EXPERTISE REQUIRED

    Matching Providers