Skip to content

Commit b57ffb9

Browse files
committed
docs: Update README
Among other things, it gives a higher-level overview of the project Signed-off-by: Lalith Suresh <lalith@feldera.com>
1 parent 7c98e7b commit b57ffb9

3 files changed

Lines changed: 130 additions & 134 deletions

File tree

CONTRIBUTING.md

Lines changed: 44 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ notification when you git push.
8787

8888
### Merging a pull request
8989

90-
Since we run benchmarks as part of the CI, it'sa good practice to preserve the commit IDs of the feature branch
90+
Since we run benchmarks as part of the CI, it's a good practice to preserve the commit IDs of the feature branch
9191
we've worked on (and benchmarked). Unfortunately, [the github UI does not have support for this](https://github.com/community/community/discussions/4618)
9292
(it only allows rebase, squash and merge commits to close PRs).
9393
Therefore, it's recommended to merge PRs using the following git CLI invocation:
@@ -101,7 +101,7 @@ git push upstream main
101101

102102
### Code Style
103103

104-
Execute the following command to make `git commit` check the code for formatting issues before commit. It is not yet applied to the sql compiler.
104+
Execute the following command to make `git push` check the code for formatting issues.
105105

106106
```shell
107107
GITDIR=$(git rev-parse --git-dir)
@@ -122,6 +122,44 @@ When opening a new issue, try to roughly follow the commit message format conven
122122

123123
# For developers
124124

125+
## Building DBSP from sources
126+
127+
DBSP is implemented in Rust and uses Rust's `cargo` build system. The SQL
128+
to DBSP compiler is implemented in Java and uses `maven` as its build system.
129+
130+
You can build the rust sources by runnning the following at the top level of this tree.
131+
132+
```
133+
cargo build
134+
```
135+
136+
To build the SQL to DBSP compiler, run the following from `sql-to-dbsp-compiler/SQL-compiler`:
137+
138+
```
139+
mvn package
140+
```
141+
142+
If you want to develop DBSP without installing the required toolchains
143+
locally, you can use Github Codespaces; from
144+
https://github.com/feldera/dbsp, click on the green `<> Code` button,
145+
then select Codespaces and click on "Create codespace on main".
146+
147+
## Learning the DBSP Rust code
148+
149+
To learn how the DBSP core works, we recommend starting with the tutorial.
150+
151+
From the project root:
152+
153+
```
154+
cargo doc --open
155+
```
156+
157+
Then search for `dbsp::tutorial`.
158+
159+
Another good place to start is the `circuit::circuit_builder` module documentation,
160+
or the examples folder. For more sophisticated examples, try looking
161+
at the `nexmark` benchmark in the `benches` directory.
162+
125163
## Running Benchmarks against DBSP
126164

127165
The repository has a number of benchmarks available in the `benches` directory that provide a comparison of DBSP's performance against a known set of tests.
@@ -156,9 +194,10 @@ An extensive blog post about the implementation of Nexmark in DBSP:
156194

157195
## Updating the pipeline manager database schema
158196

159-
Here are some guidelines when contributing code that affects the Pipeline Manager's DB schema.
197+
The pipeline manager serves as the API server for Feldera. It persists API state in a Postgres DB instance.
198+
Here are some guidelines when contributing code that affects this database's schema.
160199

161-
* We use SQL migrations to apply the schema to a live database to faciliate upgrades. We use [refinery](https://github.com/rust-db/refinery) to manage migrations.
200+
* We use SQL migrations to apply the schema to a live database to facilitate upgrades. We use [refinery](https://github.com/rust-db/refinery) to manage migrations.
162201
* The migration files can be found in `crates/pipeline_manager/migrations`
163202
* Do not modify an existing migration file. If you want to evolve the schema, add a new SQL or rust file to the migrations folder following [refinery's versioning and naming scheme](https://docs.rs/refinery/latest/refinery/#usage). The migration script should update an existing schema as opposed to assuming a clean slate. For example, use `ALTER TABLE` to add a new column to an existing table and fill that column for existing rows with the appropriate defaults.
164-
* If you add a new migration script `V{i}`, add tests for migrations from `V{i-1} to V{i}`. For example, add tests that invoke the pipeline manager APIs before and after the migration.
203+
* If you add a new migration script `V{i}`, add tests for migrations from `V{i-1}` to `V{i}`. For example, add tests that invoke the pipeline manager APIs before and after the migration.

README.md

Lines changed: 85 additions & 129 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,109 @@
1-
# Database Stream Processor
2-
3-
Database Stream Processor (DBSP) is a framework for computing over data streams
4-
that aims to be more expressive and performant than existing streaming engines.
1+
# The Feldera Continuous Analytics Platform
52

63
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
74
[![CI workflow](https://github.com/feldera/dbsp/actions/workflows/ci.yml/badge.svg)](https://github.com/feldera/dbsp/actions)
85
[![codecov](https://codecov.io/gh/feldera/dbsp/branch/main/graph/badge.svg?token=0wZcmD11gt)](https://codecov.io/gh/feldera/dbsp)
96
[![nightly](https://github.com/feldera/dbsp/actions/workflows/containers.yml/badge.svg)](https://github.com/feldera/dbsp/actions/workflows/containers.yml)
107

11-
## DBSP Mission Statement
128

13-
Computing over streaming data is hard. Streaming computations operate over
14-
changing inputs and must continuously update their outputs as new inputs arrive.
15-
They must do so in real time, processing new data and producing outputs with a
16-
bounded delay.
9+
The [Feldera Continuous Analytics Platform](https://www.feldera.com/), or Feldera in short, is a
10+
fast computational engine for *continuous analytics* over data in-motion. It
11+
allows users to build data pipelines as SQL programs that are continuously
12+
evaluated as new data arrives from various sources. What makes Feldera
13+
[unique](#theory) is its ability to *evaluate arbitrary SQL programs
14+
incrementally*, making it more expressive and performant than existing
15+
alternatives like streaming engines.
16+
17+
With Feldera, software engineers and data scientists building data pipelines
18+
are not exposed to to the complexities of querying changing data, an otherwise
19+
notoriously hard problem. Instead, they can express their
20+
computations as declarative queries and have Feldera evaluate
21+
these queries incrementally, correctly and efficiently.
1722

18-
We believe that software engineers and data scientists who build streaming data
19-
pipelines should not be exposed to this complexity. They should be able to
20-
express their computations as declarative queries and use a streaming engine to
21-
evaluate these queries correctly and efficiently. DBSP aims to be such an
22-
engine. To this end we set the following high-level objectives:
23+
To this end we set the following high-level objectives:
2324

24-
1. **Full SQL support and more.** While SQL is just the first of potentially
25-
many DBSP frontends, it offers a reference point to characterize the
26-
expressiveness of the engine. Our goal is to support the complete SQL syntax
27-
and semantics, including joins and aggregates, correlated subqueries, window
28-
functions, complex data types, time series operators, UDFs, etc. Beyond
29-
standard SQL, DBSP supports recursive queries, which arise for instance in graph
30-
analytics problems.
25+
1. **Full SQL support and more.** Our goal is to support the complete SQL
26+
syntax and semantics, including joins and aggregates, correlated subqueries,
27+
window functions, complex data types, time series operators, UDFs, and
28+
recursive queries.
3129

32-
1. **Scalability in multiple dimensions.** The engine scales with the number and
33-
complexity of queries, streaming data rate and the amount of state the system
34-
maintains in order to process the queries.
30+
1. **Scalability in multiple dimensions.** The engine scales with the number
31+
and complexity of queries, input data rate and the amount of state the
32+
system maintains in order to process the queries.
3533

3634
1. **Performance out of the box.** The user should be able to focus on the
37-
business logic of their application, leaving it to the system to evaluate this
38-
logic efficiently.
35+
business logic of their application, leaving it to the system to evaluate
36+
this logic efficiently.
37+
38+
39+
## Architecture
40+
41+
With Feldera, users create data pipelines out of SQL programs and data
42+
connectors. An SQL program comprises tables and views. Connectors feed data to
43+
input tables in a program or receive outputs computed by views. Example
44+
connectors we currently support are Kafka, Redpanda and an HTTP API to push/pull
45+
directly to and from tables/views. We are working on more connectors such as
46+
ones for database CDC streams. Let us know of any connectors you'd like us to
47+
cover.
48+
49+
Feldera fundamentally operates on changes to data, i.e., inserts and deletes to
50+
tables. This model covers all kinds of data in-motion use cases, like
51+
insert-only streams of event, log, HTTP and timeseries data, as well as changes
52+
to traditional databases extracted via CDC streams.
53+
54+
The following diagram shows Feldera's architecture.
55+
56+
![Feldera Architecture](architecture.svg)
57+
58+
## What's in this repository?
59+
60+
This repository comprises all the buildings blocks to run continuous analytics pipelines using Feldera.
61+
62+
* [web UI](web-ui): a web interface for writing SQL, setting up connectors, and managing pipelines.
63+
* [pipeline-manager](crates/pipeline_manager): serves the web UI and is the REST API server for building and managing data pipelines.
64+
* [dbsp](crates/dbsp): the core [engine](#theory) that allows us to evaluate arbitrary queries incrementally.
65+
* [SQL compiler](sql-to-dbsp-compiler): translates SQL programs into DBSP programs.
66+
* [connectors](crates/adapters/): to stream data in and out of Feldera pipelines.
67+
68+
## Quick start
69+
70+
First, make sure you have [Docker Compose](https://docs.docker.com/compose/) installed.
71+
72+
Next, run the following command to download a Docker Compose file, and use it to bring up
73+
a DBSP deployment suitable for demos, development and testing:
74+
75+
```text
76+
curl https://raw.githubusercontent.com/feldera/dbsp/main/deploy/docker-compose.yml | docker compose -f - --profile demo up
77+
```
78+
79+
It can take some time for the container images to be downloaded. About ten seconds after that, the DBSP
80+
web interface will become available. Visit [http://localhost:8085](http://localhost:8085) on your browser
81+
to bring it up. We suggest going through our [demos](https://docs.feldera.io/docs/demos) next.
82+
83+
Our [Getting Started](https://docs.feldera.io/docs/intro) guide has more detailed instructions on running the demo.
84+
85+
## Documentation
86+
87+
To learn more about Feldera, we recommend going through the [documentation](https://docs.feldera.io/docs/intro).
88+
89+
* [Getting started](https://docs.feldera.io/docs/intro)
90+
* [UI tour](https://docs.feldera.io/docs/tour/)
91+
* [Demos](https://docs.feldera.io/docs/category/demos)
92+
* [SQL reference](https://docs.feldera.io/docs/sql/intro)
93+
* [API reference](https://docs.feldera.io/docs/api/rest/)
3994

4095
## Theory
4196

42-
The above objectives can only be achieved by building on a solid mathematical
43-
foundation. The formal model that underpins our system, also called DBSP, is
97+
Feldera achieves its objectives by building on a solid mathematical
98+
foundation. The formal model that underpins our system, called DBSP, is
4499
described in the accompanying paper:
45100

46101
- [Budiu, Chajed, McSherry, Ryzhyk, Tannen. DBSP: Automatic
47102
Incremental View Maintenance for Rich Query Languages, Conference on
48103
Very Large Databases, August 2023, Vancouver,
49104
Canada](https://github.com/feldera/dbsp/blob/main/docs/static/vldb23.pdf)
50105

51-
- Here is the [video of a DBSP
52-
presentation](https://www.youtube.com/watch?v=iT4k5DCnvPU) at the 2023
106+
- Here is the [a presentation about DBSP](https://www.youtube.com/watch?v=iT4k5DCnvPU) at the 2023
53107
Apache Calcite Meetup.
54108

55109
The model provides two things:
@@ -59,105 +113,7 @@ queries built out of these operators, and precisely specifies how these queries
59113
must transform input streams to output streams.
60114

61115
1. **Algorithm.** DBSP also gives an algorithm that takes an arbitrary query and
62-
generates a dataflow program that implements this query correctly (in accordance
116+
generates an incremental dataflow program that implements this query correctly (in accordance
63117
with its formal semantics) and efficiently. Efficiency here means, in a
64118
nutshell, that the cost of processing a set of input events is proportional to
65119
the size of the input rather than the entire state of the database.
66-
67-
## DBSP Concepts
68-
69-
DBSP unifies two kinds of streaming data: time series data and change data.
70-
71-
- **Time series data** can be thought of as an infinitely growing log indexed by
72-
time.
73-
74-
- **Change data** represents updates (insertions, deletions, modifications) to
75-
some state modeled as a table of records.
76-
77-
In DBSP, a time series is just a table where records are only ever added and
78-
never removed or modified. As a result, this table can grow unboundedly; hence
79-
most queries work with subsets of the table within a bounded time window. DBSP
80-
does not need to wait for all data within a window to become available before
81-
evaluating a query (although the user may choose to do so): like all queries,
82-
time window queries are updated on the fly as new inputs become available. This
83-
means that DBSP can work with arbitrarily large windows as long as they fit
84-
within available storage.
85-
86-
DBSP queries are composed of the following classes of operators that apply to
87-
both time series and change data:
88-
89-
1. **Per-record operators** that parse, validate, filter, transform data streams
90-
one record at a time.
91-
92-
1. The complete set of **relational operators**: select, project, join,
93-
aggregate, etc.
94-
95-
1. **Recursion**: Recursive queries express iterative computations, e.g.,
96-
partitioning a graph into strongly connected components. Like all DBSP queries,
97-
recursive queries update their outputs incrementally as new data arrives.
98-
99-
In addition, DBSP supports **windowing operators** that group time series data
100-
into time windows, including various forms of tumbling and sliding windows,
101-
windows driven by watermarks, etc.
102-
103-
## Architecture
104-
105-
The following diagram shows the architecture of the DBSP platform. Blocks
106-
with solid borders indicate existing components. Blocks with dashed borders
107-
are on our TODO list.
108-
109-
```text
110-
Frontends
111-
┌────────────────────────────────────────────┐
112-
│┌─────┐ +---------------------------------+│
113-
││ SQL │ | Language bindings (Python, ...) |│
114-
│└─────┘ +---------------------------------+│
115-
I/O adapters │┌──────────────────────────────────────────┐│
116-
┌────────────┐ ││ Optimizer ││
117-
│+----------+│ │└──────────────────────────────────────────┘│
118-
│| Kinesis |│ └────────────────────────────────────────────┘
119-
│+----------+│ +--------------------------------------------+
120-
│┌──────────┐│ |Distributed runtime, scale-out |
121-
││ Kafka ││ |┌──────────────────────────────────────────┐|
122-
│└──────────┘│ |│ │|
123-
│+----------+│ |│ │|
124-
│|PostgreSQL|│ |│ DBSP core engine │|
125-
│+----------+│ |│ │|
126-
│ ... │ |│ │|
127-
│+----------+│ |└──────────────────────────────────────────┘|
128-
│| |│ |┌──────────────────────────────────────────┐|
129-
│+----------+│ |│ Persistent indexes │|
130-
└────────────┘ |└──────────────────────────────────────────┘|
131-
+--------------------------------------------+
132-
```
133-
134-
The DBSP core engine is written in Rust and provides a Rust API for building
135-
data-parallel dataflow programs by instantiating and connecting streaming
136-
operators. Developers can use this API directly to implement complex
137-
streaming queries. We are also developing a
138-
[compiler from SQL to DBSP](sql-to-dbsp-compiler) that
139-
enables engineers and data scientists to use the engine via a familiar
140-
query language. In the future, we will add DBSP bindings for languages
141-
like Python and Scala.
142-
143-
At runtime, DBSP can consume inputs from and send outputs to
144-
event streams, e.g., Kafka, databases, e.g., Postgres, and data warehouses,
145-
e.g., Snowflake.
146-
147-
The distributed runtime will extend DBSP's data-parallel execution model to
148-
multiple nodes for high availability and throughput.
149-
150-
## Getting started
151-
152-
DBSP is implemented in Rust and uses Rust's `cargo` build system. You
153-
can build everything with `cargo build` at the top level of this tree.
154-
If you want to do development without installing the Rust toolchain
155-
locally, you can use Github Codespaces: from
156-
https://github.com/feldera/dbsp, click on the green `<> Code` button,
157-
then select Codespaces and click on "Create codespace on main".
158-
159-
To learn about using DBSP as a Rust programmer, start by reading the
160-
[tutorial]. Another good place to start is the
161-
[`circuit_builder`](`circuit::circuit_builder`) module documentation,
162-
or the examples folder. For more sophisticated examples, try looking
163-
at the `nexmark` benchmark in the `benches` directory.

architecture.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)