Data Engineer Consultant to Implement a Business Intelligence Pipeline

Industry Pharmaceutical and Life Sciences, Healthcare, Hi-Tech

Specialization Or Business Function

Technical Function Data Management, Data Engineering

Technology & Tools Big Data and Cloud (Amazon Elastic MapReduce, Apache Hadoop, Apache Spark), Data Warehouse Appliances, Programming Languages and Frameworks (Python)

WORK IN PROGRESS

Project Description

We are a 3-year-old start-up that offers consumers a ground-breaking, personalized and preventative approach to improving their health.

We are seeking an experienced Data Engineer to consult on, build and document a workflow for our Business Intelligence pipeline. Success criteria incluces 100% automation, and the ability to transition to a competent staff of [primarily software] engineers.

The business processes data from multiple sources on a daily basis to fuel its growing health and wellness platform.bThe platform currently runs on Spark within AWS, and we are strongly considering using AWS EMR.

This is a 2-4 month position depending on the depth of analysis and development.

Project applicants will need to:

Assess the extract process from our set of ~100 relational tables on Postgres into an AWS S3 bucket.Ideally this process will extract only the data that has changed since the previous extraction.Total daily data processed would be less than 1GB.
Recommend, build, and document a solution for both a staging and production environment.
Included with the extraction process, we must include two additional data sources: (MixPanel – our data partner that collects usage for our mobile apps and dashboards, custom CSV data stored on sharepoint)
Preferably an extraction solution would include instructions for easily adding additional data sources.
Implement and document an EMR job that translates the extractions into a new S3 bucket that can be read using our Data Warehouse platform. The current setup runs on a single server and we would like to leverage the distributed capabilities of EMR.

We should consider costs as part of this exercise and consider alternatives if appropriate. All translations currently exist in python scripts and could ideally be reused.

We currently use Snowflake to load the data from the S3 bucket and made available in PowerBI to build and run reports. Either continue to use this existing infrastructure or make a recommendation for a more cost effective and/or scalable solution.

We require expertise (>3 years of experience) with the following technologies:

Postgres
Elastic Map Reduce
Spark/Hadoop
Snowflake (or similar Data Warehousing software)
Python (or similar language for loading data)

Milestones

Single use case

Execute HIPAA compliance agreement and take 20 minute HIPAA training.
Configure existing environment
Collect necessary permissions
Build a single lambda job that extracts a single table (via a view), transforms the data using an existing .sql transform (if possible), dumps the data into redshift (or an agreed upon alternative)
Verify that data is available from PowerBI

Single use case with incremental data

Implement views on top of Postgres tables to protect against PII/PHI data breach, and to protect against schema changes breaking the pipeline (or alternative approach to address this problem)
Ensure that only modified data is copied to redshift
Estimate duration of full ETL, and determine if 15 minute limitation within AWS lambda is appropriate

Migrate all existing necessary transforms to new architecture (verify with Ben if any appropriate schema changes should be made) - This includes all three data sources (
Build a parallel staging environment, develop release process for lambda functions and/or redshift schema changes (agreed upon by Mark and dev team)
Scheduling/automation of pipeline

Error handling to Sentry/Sumo
Determine the proper cadence (hourly?)
Build a test script to verify the success of the full ETL (prod and staging)

Documentation of procedures for future changes

Project Overview

Posted

December 10, 2018
Planned Start

January 02, 2019
Delivery Date

March 05, 2019
Preferred Location

Seattle, Washington, United States

Client Overview

A*******
Projects

100 % Awarded ( 1 of 1 )

EXPERTISE REQUIRED

FUTURE OF WORK PLATFORM

COMPARE OFFERINGS

UPSKILLING PLATFORM

EXPERFY TALENTCLOUDS

Custom TalentClouds

Data Engineer Consultant to Implement a Business Intelligence Pipeline

Project Description

Project Overview

Client Overview

A*******