Project Description
Context:
We are an early stage tech company focused on combining survey and physical data to create accurate, early forecasts and metrics for a global customer base.
Task:
- A government organization has been in the business of monitoring and forecasting a key economic activity for more than 100 years. Their monhtly outlook reports have become a critical source of truth to the markets
- The task at hand is to forecast these government reports at both the national and subnational level at least one week ahead of time, refreshed daily.
- Final models will be assessed based on accuracy and earliness across a 10+ year time window (backtesting)
Materials:
- Data are all available at state and national levels on a daily basis for 2003-2016 unless otherwise specified:
- Government forecasts at the state and national level, at a monthly cadence (this is what is being forecasted in this pilot work)
- Government end of year actuals at the state and national level (this is what the government is forecasting)
- Survey-based data on a weekly basis that is relevant to the target outcome and available at the state level
- Earth data features at a state and sub-state level for key physical variables; formatted for tabular ingest; hundreds of predictors on a daily basis
Timeframe & Project Plan:
- Overall Aim for productionalized pilot models by end of July; ideally starting work as soon as possible
- Four week timeline, with assumption of one full time resource
- Weekly milestones and project go-no-go decision at end of week one
Ramp Up | week 1
- 2-3 meetings with the team to understand the domain / challenge
- Data onboarding for all of the materials
- Discussion of workplan
- Data exploration
- Variable selection
- Modeling plan
- Outline of white paper (2-3 pages)
- Very rough, first prototype models (initial results)
Preliminary Models | week 2
- Analysis of early results; backtesting prepared
- Prototype models
- Prioritization for model refinements
- Pre-engineering for putting into production
Model Improvements | week 3
- Revised prototype models
- Updated analysis and backtesting prepared
- Pre-engineering for putting into production
- Draft white paper (2-3 pages)
Model Improvements | week 4
- Final set of changes / permutations
- Placing models into production alongside engineering
- Final QA and backtesting
- White paper finalized (2-3 pages)
Logistics:
Location: Preference for onsite but flexibility for video / remote work; strong preference for roughly in timezone
Engagement: Aim is for full time engagement; 40 hours per week
Tools: R and/or Python in the Scikit-Learn framework; collaborator should be a master of either or both of these frameworks
Modeling: Deep experience in machine learning-based predictive modeling and timeseries; example models where candidate should have long experience with applied work include Random Forest, SVM, Cubist, GBM, etc.
Data engineering: Our team will deliver large, structured data cubes (flat files) for modeling; candidate should be familiar with handling at-scale data challenges; that said, local machine execution should be adequate (no obvious need to distribute or use high performance compute)