We’re building a scalable data platform that processes and analyzes massive volumes of events and connections every day – the foundation for advanced cybersecurity research models. You’ll work across batch + real-time streaming, complex data modeling, high-performance databases, and production workflows that support ML pipelines, LLMs, and RAG.
What you’ll do
- Architect & build the data architecture: data lakes, warehouses, and data pipelines
- Create and manage data integration pipelines for ingesting large-scale, complex datasets from diverse sources—ensuring data quality and consistency
- Own data orchestration: automate workflows using orchestration tools
- Drive database performance: indexing, aggregations, partitioning, and high-performance query execution
- Ensure scalability & reliability under load for real-world enterprise data volumes
- Partner with Data Science to integrate ML + LLM workflows, including real-time servicing of RAG pipelines and ML-driven product features
- Collaborate with engineers and cybersecurity experts to align the platform with business and research goals
- Stay ahead of new tech in big data, cloud, and security to continuously improve the platform
- Maintain high engineering standards with strong testing and CI/CD practices
Requirements
- Proven experience designing and building large-scale data systems from scratch in AWS / GCP / Azure
- Strong backend engineering background in complex, data-driven systems
- Hands-on expertise in data modeling (schemas supporting real-time + batch)
- Experience with streaming technologies: Kafka, Flink, or Spark Streaming
- Deep understanding of modern databases: SQL, NoSQL, NewSQL, and graph databases + performance optimization at scale
- Knowledge of data security, encryption, and governance best practices
- Strong focus on code quality, reliability, testing, and CI/CD
- Experience with microservices, Kubernetes, Terraform – advantages
- Experience with orchestration tools like Dagster or Apache Airflow – advantages