big-data

"Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, but also deliver this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easi

I was going though the existing enhancement issues again and though it'd be nice to collect ideas for spaCy plugins and related projects. There are always people in the community who are looking for new things to build, so here's some inspiration ✨ For existing plugins and projects, check out the spaCy universe.

If you have questions about the projects I suggested,

In the case of --files, there is a breach of the general contract that pio passes anything after the -- in a pio command to the spark submit command. This line does not take into account the contents of --files that come after the --.

There's no published benchmark for IOPS on S3 storage

Would it be possible to post this alongside the other benchmarks?

S3 storage would be super cheap way to get started because it's serverless (thus more folks would potentially use gun.js)

Thank you for the useful service. I would like to see more Auth/ABAC for startup usage, right now I'm using a centralized database because it's uncle

FileSystemContext in presto-raptor can now be replaced by HdfsContext given presto-hive-metastore has been separated into a standalone module.

Use case
Avoid cache thrashing while fetching big data parts from replicas.

I'm running a benchmark on kafka, when the message throughput reaches a very high value, it's displayed incorrectly in the chart.

AFAICT they are equivalent. Found a usage of PyObject_str here and it looks like the optimization isn't made in other places where we just do str(x).

I was happy to see that the usage of PyUnicode_Join was unnecessa

Problem: grid search does not allow for multiple options for class_weights
catboost version:0.20
Operating System:Windows
CPU: Intel(R) Xeon(R) CPU E3-1225 v5

Description

I install new 6 node cluster. Enable authentication and add 5 nodes through Fauxton.
When I run Verify CouchDB Installation from Fauxton I see an error in Replication check

Error: unauthorized to access or create database http://0.0.0.0:5984/verifytestdb_replicate/
And on one of the node I see an error:
[error] 2019-12-22T16:05:37.312700Z couchdb@s2dfw.domain.net <0.26254.18

The pipeline spec doc say that the input field is required, but they aren't for spouts.

The docs have a great intro that explains the technology buildup to arrive at inventing stream but then it stops without explaining how stream uses Cassandra + Redis (plus celery message queue?) to solve this problem. (For all I know it doesn't.)

As a developer, a quick explanation of how this framework solves the

Can't search fields that can be in both request/response. For example adding content-type to both request and response headers creates a single http.content-type expression and which it actually searches is unknown. Probably should create http.request.content-type and http.response.content-type or something.

Work around for now is

[custom-fields]
http.request.content-type=db:http.reque

Hazelcast currently ships with java.util.logging as the default. Besides badly formatted default
output, it also makes synchronized calls for each log message which incurs some performance cost and might lead to unexpected behaviour (i.e. hiding data races, etc)

Spark 2.3 officially support run on kubernetes. While our guide of "Run on Kubernetes" is still based on a special version of Spark 2.2, which is out of date. We need to:

update that document to Spark 2.3
release the corresponding docker images.

Today, the Hadoop integration tools for Vespa support Hadoop and Pig for feeding and querying Vespa. The Pig feeder is a thin wrapper around the Vespa HTTP client.

We should support feeding directly from Spark as well, to avoid Spark pipelines having to write

参考文档

Gitbook文档(超详细的电子书),一定好好阅读,会减少你使用中的很多不必要的麻烦和问题

wiki

CBoard二次开发总结

1)自定义报表的需求：

简化分析师工作，释放前端生产力---“Type SQL, Get Chart”
CBoard目前的定位和Tableau一样，是一个专业的报表引擎
拖拖拽拽完成交互式分析
.
.

开源选型时参考了社区的众多

I have noticed a small error in the documentation around S3 configurations:
https://docs.delta.io/latest/delta-storage.html#amazon-s3

On the read part, it should be load and not save:
spark.read.format("delta").load("s3a://<your-s3-bucket>/<path>/<to>/<delta-table>")

Also, I have successfully tested Delta 0.5.0 with on-premise S3 - https://min.io
There were some quirks around the

big-data

Here are 1,888 public repositories matching this topic...

apache / spark

binhnguyennus / awesome-scalability

donnemartin / data-science-ipython-notebooks

explosion / spaCy

apache / predictionio

apache / flink

amark / gun

prestodb / presto

ClickHouse / ClickHouse

yahoo / CMAK

apache / storm

heibaiying / BigData-Notes

cython / cython

catboost / catboost

h2oai / h2o-3

apache / zeppelin

apache / couchdb

Description

pachyderm / pachyderm

tschellenbach / Stream-Framework

apache / beam

aol / moloch

hazelcast / hazelcast

intel-analytics / BigDL

apache / camel

vespa-engine / vespa

apache / ignite

jostmey / NakedTensor

TuiQiao / CBoard

delta-io / delta

apache / flume

Improve this page

Add this topic to your repo