Connect and model (medium level complexity) distributed data sets to build repositories, such as data warehouses, data lakes, using appropriate technologies.
Manage data related contexts ranging across addressing medium to large sized data sets, structured/unstructured or streaming data, extraction, transformation, curation, modelling, building data pipelines, identifying right tools, writing SQL/Java/Python code.
Contributor in the Community of Practice/Center of Excellence to create/enhance standards and best practices
- Partner with Data Architect to enhance/maintain optimal data pipeline architecture aligned to published standards
- Assemble medium, complex data sets to meet functional /non-functional requirements
- Design and implement internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc.
- Build the infrastructure required for optimal extraction transformation, and loading of data from a wide variety of data sources
- Document data sources in enterprise data catalog with metadata, lineage and classification information
- Develop aggregations and algorithms needed for reporting and analytics with low level complexity
- Build analytics tools that utilize the data pipeline to provide actionable insights into customer acquisition, operational efficiency and other key business performance metrics
- Work with stakeholders including Domain leads, and Teams to assist with data-related technical issues and support their data infrastructure needs
- Ensure technology footprint adheres to data security policies and procedures related to encryption, obfuscation and role based access
- Create data tools for analytics and data scientist team members
- Knowledge of data and analytics framework supporting data lakes, warehouses, marts, reporting, etc
- Defining data retention policies, monitoring performance and advising any necessary infrastructure changes based on functional and non-functional requirements
- In depth knowledge of data engineering discipline
- Extensive experience working with Big Data tools and building data solutions for advanced analytics
- Minimum of 5+ years' hands-on experience with a strong data background
- Solid programming skills in Java, Python and SQL
- Clear hands-on experience with database systems - Hadoop ecosystem, Cloud technologies (e.g. AWS, Azure, Google), in-memory database systems (e.g. HANA, Hazel cast, etc) and other database systems - traditional RDBMS (e.g. Teradata, SQL Server, Oracle), and NoSQL databases (e.g. Cosmos, MongoDB, DynamoDB)
- Practical knowledge across data extraction and transformation tools - traditional ETL tools (e.g. Informatica, DataBricks) as well as more recent big data tools
- ML training and concurrency I/O and extra API calls, dataset processing in parallel for training. Python coding with multiple GPU cores
- Expertise of docker and k8s. Being familiar with yaml for deployments.
- Background in programming, databases and/or big data technologies OR
- BS/MS in software engineering, computer science, economics or other engineering fields