Open source project for data preparation of LLM application builders
-
Updated
Feb 28, 2025 - HTML
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Open source project for data preparation of LLM application builders
Beginner data engineering project - batch edition
Spark library for generalized K-Means clustering. Supports general Bregman divergences. Suitable for clustering probabilistic data, time series data, high dimensional data, and very large data.
Apache Spark™ and Scala Workshops
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
A concise resource repository for machine learning
Teaching Materials for Distributed Statistical Computing (大数据分布式计算教学材料)
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra
Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline
Spark algorithms for building k-nn graphs
Ansible roles to deploy Kubernetes, JupyterHub, Jupyter Enterprise Gateway and Spark on Kubernetes cluster
Kaggle's Predict Future Sales competition project (TOP 15 solution as of March 2020)
Workshop Big Data en Español
Adds a notification panel to your Laravel Spark Kiosk, allowing you to send notifications to users.
Lecture: Big Data
Created by Matei Zaharia
Released May 26, 2014