Big Data Operations Specialists are required to participate in the operations and deployment of NoSQL databases. They will also play a pivotal role in the operation of big data pipelines both on-premises and cloud. TalentCloud members should be able to serve as a senior member who focuses on the availability, reliability, and sustainability of the data platform components. Other abilities and experiences include:
- Experts in this TalentCloud should be able to work closely with the Data Platform and DevOps teams to enable faster deployment of data-driven applications
- Expertise in state-of-the-art tools and frameworks to build scalable and efficient solutions for data management, data pre-processing and data set building
- Experience in deploying Databases and data pipelines from end to end into production environments
- The data platform could consist of large Hadoop, Spark, HBase (or other NoSQL databases), and Kafka clusters in premises or on cloud
Responsibilities
- Deploy data pipelines and testing frameworks to different development, QA, Stage, and production environments
- Monitor maintain, provision and upgrade, troubleshoot Hadoop, Hbase, Spark, and Kafka systems to support a complex Data Pipeline Platform
- Participate in an on-call rotation responding to alerts and systems issues for Hadoop, Hbase, Kafka, and more
- Troubleshoot, repair, and recover from hardware or software failures
- Identify and resolve faults, inconsistencies, and systemic issues. Coordinate and communicate with impacted constituencies
- Manage user access and resource allocations to Data Pipeline Platform
- Develop tools to automate routine day-to-day tasks such as security patching, software upgrades, hardware allocation. Utilize automated system monitoring tools to verify the integrity and availability of all hardware, server resources, and critical processes
- Engage other teams during outages or planned maintenance
- Administer development, test, QA and production servers
- Triage outages & defects and resolve them within established SLA’s
- Work with Application Developers and Solution Architects to identify opportunities to improve operational and supportability model as part of continuous improvement and maturing CDA’s Production Operation function
- Develop and monitor a milestone-based schedule of Production Readiness activities
- Monitor the stability and performance of production environment after major releases and provides insight to future releases for any improvement opportunities
- Recommend cluster upgrades and ensure reliable functionality for CDA customers
- Perform cluster and system performance tuning
Required Skills
- Relevant experience in implementing, troubleshooting and supporting the Unix/Linux operating system with concrete knowledge of system administration/internals
- Relevant experience in scripting/writing/modifying code for monitoring/deployment/automation in one of the following (or comparable): Python, Shell, Go, Perl, Java, C
- Relevant experience with any of the following technologies: Hadoop-HDFS, Yarn-MapReduce, HBase, Kafka
- Relevant experience with any of the following technologies: Puppet, Chef, Ansible or equivalent configuration management tool
- Familiar with TCP/IP networking DNS, DHCP, HTTP etc.
- Strong written and oral communication skills with the ability to interface with technical and non-technical stakeholders at various levels of the organization
- Beneficial skills and experience (if you don’t have all of them, you can learn them at Xandr):
- Experience with JVM and GC tuning is a plus
- Regular expression fluency
- Experience with Nagios or similar monitoring tools
- Experience with data collection/graphing tools like Cacti, Ganglia, Graphite, and Grafana
- Experience with tcpdump, ethereal, tshark, and other packet capture and analysis tools
- Demonstrated ability to quickly adapt, learn new skill sets, and be able to understand operational challenges. Self-starter
- Strong analytical, problem-solving, negotiation, and organizational skills with a clear focus under pressure
- Must be proactive with proven ability to execute multiple tasks simultaneously
- Resourceful, results orientated with the ability to get things done and overcome obstacles
- Excellent interpersonal skills, including relationship building with a diverse, global, cross-functional team
- Proficient in SQL and creating ETL processes
- Previous experience building or deploying efficient large-scale data collection, storage, and processing pipelines
- Knowledge of database systems, big data concepts, and cluster computing frameworks (e.g. Spark, Hadoop, or other tools)
- Experience working in a cloud learning environment, including the deployment of models to production
- Experience with Agile, Continuous Integration, Continuous Deployment, Test Driven Development, Git
- Understanding of time, RAM, and I/O scalability aspects of data science applications (e.g. CPU and GPU acceleration, operations on sparse arrays, model serialization and caching)