**Responsibilities**:
- Design, develop, and implement data pipelines for ingesting, pre-processing, and transforming diverse data types (html, Image,.pdf, Audio, video) for Generative AI model training and inference.
- Engineer data for vector databases (e.g., Pinecone, Redis, Chroma) and Large Language Models (LLM, GPT-4, Claude 2.0) and for tasks like text summarization, entity extraction, and classification.
- Build and maintain efficient data storage solutions, including data lakes, warehouses, and databases suitable for large-scale generative AI datasets. Implement data security and governance policies to safeguard the privacy and integrity of sensitive data used in Generative AI projects.
- Collaborate with data scientists and engineers to understand data requirements for Generative AI models and translate them into efficient data pipelines.
- Monitor and optimize data pipelines for performance, scalability, and cost-effectiveness.
- Build analytical tools to utilize the data pipeline, providing actionable insight into key business performance metrics, including operational efficiency and customer acquisition.
- Work with stakeholders, including data, design, product, and executive teams, and assisting them with data-related technical issues
- Collaborate with stakeholders, including the Executive, Product, Data, and Design teams, to support their data infrastructure needs while assisting with data-related technical issues.
**Qualifications**:
- Bachelor's degree in computer science, data science, statistics, or a related field, or equivalent experience.
- 6+ years of proven experience in data engineering, ETL, SQL, database, JSON data, data pipeline development, building data platforms, and data storage technologies.
- 2+ years of experience building and maintaining data pipelines for machine learning projects.
- Strong understanding of data structures, data modeling principles, data quality measures, and data security best practices, with experience in transforming, cleaning, and organizing unstructured data
- High proficiency in Python, SQL, and scripting languages.
- Experience in continuous integration/deployments for large data pipelines and familiarity with containerization technologies (e.g., Docker) and orchestration tools (e.g., Kubernetes) for scalable and efficient model deployment.
- Familiarity with implementing data and/or machine learning algorithms in production systems (e.g. AWS Sagemaker, GCP Datalab, or custom implementation);
- Hands-on experience with cloud platforms (e.g., OCI, AWS, GCP, Azure) for data storage and processing, along with Gen AI services like OCI Gen AI, Azure AI Services, or AWS BedRock.
- Strong problem-solving skills and the ability to analyze data and design solutions to complex data issues.
- Familiarity with modern ETL stack (Airflow, DBT, Snowflake), data stream frameworks (Kafka, Kinesis), vector databases (e.g., Pinecone, Redis, Chroma) and OpenSearch / Elasticsearch.
- Understanding of Large Language Models (LLM, GPT-4, Claude 2.0) for tasks like text summarization, entity extraction, and classification.
- Excellent communication skills and the ability to convey complex technical concepts to non-technical stakeholders.
- Ability to work independently and collaboratively in a fast-paced environment.
- Practical knowledge of Agile project management and software development methodologies such as Scrum and SAFe.
- Experiencing working with globally distributed teams.