#
apache-spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Here are 1,122 public repositories matching this topic...
酷玩 Spark: Spark 源代码解析、Spark 类库等
-
Updated
May 26, 2019 - Scala
Interactive and Reactive Data Science using Scala and Spark.
-
Updated
Mar 31, 2021 - JavaScript
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
python
scala
apache-spark
pytorch
keras-tensorflow
bigdl
distributed-deep-learning
deep-neural-network
analytics-zoo
-
Updated
Oct 15, 2021 - Jupyter Notebook
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
-
Updated
Aug 16, 2021 - Java
nopcoder
commented
Oct 14, 2021
Current documentation version drop down list the label "Latest" first and the other versions in incremental order.
GoEddie
commented
Dec 30, 2019
This is to track implementation of the ML-Features: https://spark.apache.org/docs/latest/ml-features
Bucketizer has been implemented in dotnet/spark#378 but there are more features that should be implemented.
- Feature Extractors
- TF-IDF
- Word2Vec (dotnet/spark#491)
- CountVectorizer (https://github.com/dotnet/spark/p
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
kubernetes
spark
apache-spark
kubernetes-operator
kubernetes-controller
kubernetes-crd
google-cloud-dataproc
-
Updated
Oct 15, 2021 - Go
Apache Spark docker image
-
Updated
Sep 15, 2021 - Dockerfile
A curated list of awesome Apache Spark packages and resources.
-
Updated
Aug 25, 2021
PySpark + Scikit-learn = Sparkit-learn
-
Updated
Dec 31, 2020 - Python
(Deprecated) Scikit-learn integration package for Apache Spark
-
Updated
Dec 3, 2019 - Python
C# and F# language binding and extensions to Apache Spark
streaming
spark
apache-spark
csharp
fsharp
bigdata
dataset
spark-streaming
eventhubs
mapreduce
dataframe
rdd
dstream
mobius
kafka-streaming
near-real-time
-
Updated
Jan 29, 2021 - C#
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
python
airflow
spark
apache-spark
scheduler
s3
data-engineering
data-lake
warehouse
redshift
data-migration
livy
etl-framework
apache-airflow
emr-cluster
etl-pipeline
etl-job
data-engineering-pipeline
airflow-dag
goodreads-data-pipeline
-
Updated
Mar 9, 2020 - Python
R interface for Apache Spark
-
Updated
Oct 7, 2021 - R
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
-
Updated
Jan 24, 2017 - Scala
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
data-science
machine-learning
spark
apache-spark
deep-learning
hadoop
tensorflow
keras
keras-models
optimization-algorithms
data-parallelism
distributed-optimizers
-
Updated
Jul 25, 2018 - Python
Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
-
Updated
Jan 8, 2020 - Scala
Streaming System 相关的论文读物
streaming
apache-spark
storm
stream-processing
spark-streaming
dataflow
flink
heron
drizzle
millwheel
s4
streaming-engine
spe
stream-processing-engine
-
Updated
Mar 31, 2018
A command-line tool for launching Apache Spark clusters.
-
Updated
Jun 13, 2021 - Python
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
-
Updated
Feb 22, 2021 - Java
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
-
Updated
Apr 15, 2021 - Scala
A list about Apache Kafka
infrastructure
kafka
apache-spark
stream-processing
apache-kafka
kafka-streams
data-processing
data-pipeline
streaming-data
-
Updated
Jul 27, 2021
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
python
vagrant
data-science
data
machine-learning
airflow
kafka
spark
apache-spark
analytics
machine-learning-algorithms
python3
amazon-ec2
python-3
apache-kafka
amazon-web-services
predictive-analytics
agile-data
data-syndrome
agile-data-science
-
Updated
Sep 8, 2021 - Jupyter Notebook
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.
-
Updated
Jun 15, 2021 - Scala
The Internals of Spark Structured Streaming
-
Updated
May 23, 2021
A boilerplate for writing PySpark Jobs
-
Updated
Mar 30, 2021 - Python
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
-
Updated
Sep 14, 2015 - Shell
Created by Matei Zaharia
Released May 26, 2014
- Repository
- apache/spark
- Website
- spark.apache.org
- Wikipedia
- Wikipedia
MLflow Roadmap Item
This is an MLflow Roadmap item that has been prioritized by the MLflow maintainers. We're seeking help with the implementation of roadmap items tagged with the
help wantedlabel.For requirements clarifications and implementation questions, or to request a PR review, please tag @BenWilson2 in your communications related to this issue.
Proposal Summary
Includ