Astyanax docs by newkek · Pull Request #732 · apache/cassandra-java-driver

newkek · 2016-08-26T17:12:26Z

No description provided.

adutra · 2016-09-09T13:58:16Z

+## How Configuring the Java driver works
+
+The two basic components in the Java driver are the `Cluster` and the `Session`.
+The `Cluster` is the object to create first and on which to will apply all 


This needs to be rephrased: "on which to will apply"

newkek · 2016-09-13T10:05:12Z

Rebased on top of current 3.x and squashed.

olim7t

My main comment is about the Thrift to CQL schema.

The others are trivial, I can address them if you want.

olim7t · 2016-09-14T18:25:23Z

+*CQL columns* (the key column in Figure 1). The *“Column”* part of the Column-value
+component in a *Thrift Row*, becomes the *Clustering ColumnKey* in *CQL*, and can
+also be composed of multiple columns (in the figure, column1 is the only column 
+composing the *Clustering ColumnKey*).


In the thrift example, are "col1, col2, col3" the cell names, and "a, b, c" the cell values? If so, I don't agree with the second part of the schema, cell names should translate to different values of the clustering column.
See the "Dynamic Column family" example in this blog post: the thrift cell names (timestamps) become the values of the "time" CQL column. The text says

map [...] the second component [of the CQL primary key, i.e. the clustering column] (time) to the internal cell name. And the last CQL3 column (url) will be mapped to the cell value"

It gets more complicated with multiple clustering columns and multiple non-primary key rows (see the "Composite" and "Non-compact tables" examples respectively), but the value of the CQL clustering column always goes into the name of the internal cell.

I think the example should also illustrate a case where a single Thrift wide row translates to multiple CQL rows, it seems like that's a common hurdle for people coming from Thrift.

That was tripping me up as well, initially I was conflating how things are stored on disk with the chart here. I think it would be more clear if the chart represented the mental models for Thrift and CQL and Thrift has 1 row per Partition with the cell names like:

'a:col2', 'a:col3', 'b:col2', 'b:col3'

Where the CQL chart would represent multiple rows within a partition with each Row being 1 line each within a partition, where thrift is 1 line of rows overall within the partition.

Initially the schema was like this: https://gist.github.com/newkek/4a0cbe91577886383aaa9ef89701cf03
I changed it to this version after a discussion with @adutra, I think it kind of represents more "the concept" even though it's not the exact memory representation. I don't mind changing it back as I believe it is rather subjective. Also noticing that I updated the schema without updating the explanation so actually currently both don't match.

I like the current version compared to the original diagram. The two suggestions I would have:

The column/cell names in the thrift diagram should be clusteringkeyvalue:columnname (i.e. a:col2, a:col3 and so on). The clustering key shouldn't be it's own column.

I think it would be good to show what it would look like though if you had multiple rows per partition. The current diagram only has 1 cql row/unique cluster per partition, it would be good to demonstrate what it would look like if there were at least 2. (i.e. have two rows for key=1 where key=1,col1=a and key=1,col1=b). Hopefully that makes sense.

What is really confusing to me is that with the pre-3.0 storage engine, whether or not you use CQL or Thrift, the way data is stored is a 1:1 with how I visualize Thrift. Where with the 3.0 storage engine, it is nearly 1:1 with how I visualize CQL. That's why I think it is very easy for the reader to conflate the data model with how things are stored.

Maybe i'm interpreting that diagram wrong, but isn't there only 1 clustering value for the thrift diagram for each row (if Col : 1 is the clustering value)?

Hm I see what you mean, was it possible to have two "Col:1" for the same Row Key in Thrift? I don't see how I can do that, with Astyanax at least, I tried inserting twice for the Same Row Key and The Same "Col:x" 2 different values, and one erases the other, since the "Col:" is part of the Clustering key

Sorry, it's easy to conflate terms so i'm probably confusing things a bit.

Taking a step back to describe what I'm trying to say with an example data model. I was thinking of the table that you were sharing having a composite comparator that was CompositeType(Int32, UTF8Type) where Int32 is the value of Col1, and UTF8Type is the name of the column.

As an example, say you want to represent readings from a thermostat. The thermostat is identified by an integer id which will be our row key. A thermostat has 'readings' identified by a reading id, and each reading has a temp (for the current temperature), a mode (cooling, heating, off) and a setpoint (for target temperature).

Using cassandra-cli, i could create a schema like so:

create keyspace ks; use ks; create column family readings with key_validation_class = Int32Type and comparator = 'CompositeType(Int32Type, UTF8Type)' and default_validation_class = UTF8Type;

I could get even fancier and provide column metadata, but leaving this as is to keep it simple.

I could then create multiple readings for an individual thermostat, i.e.:

set readings[0]['0:temp'] = '72'; set readings[0]['0:mode'] = 'heating'; set readings[0]['0:setpoint'] = '73'; set readings[0]['1:temp'] = '75'; set readings[0]['1:mode'] = 'cooling'; set readings[0]['1:setpoint'] = '70'; set readings[1]['5:temp'] = '65'; set readings[1]['5:mode'] = 'off'; set readings[1]['5:setpoint'] = '70';

When reading back the data, I get:

[default@ks] get readings[0]; => (name=0:mode, value=heating, timestamp=1474483110889000) => (name=0:setpoint, value=73, timestamp=1474483119088000) => (name=0:temp, value=72, timestamp=1474483101433000) => (name=1:mode, value=cooling, timestamp=1474483137460000) => (name=1:setpoint, value=70, timestamp=1474483147380000) => (name=1:temp, value=75, timestamp=1474483127437000) [default@ks] get readings[1]; => (name=5:mode, value=off, timestamp=1474483198322000) => (name=5:setpoint, value=70, timestamp=1474483205968000) => (name=5:temp, value=65, timestamp=1474483192274000)

In that example, for a Thrift row for a given key, I have multiple readings. In CQL, The concept of a row layout would depend on the clustering, but I would preferably model this like:

create table readings_cql (k int, reading_id int, mode text, setpoint float, temp float, primary key (k, reading_id)); -- insert data insert into readings_cql (k, reading_id, mode, setpoint, temp) values (0, 0, 'heating', 73, 72); insert into readings_cql (k, reading_id, mode, setpoint, temp) values (0, 1, 'cooling', 70, 75); insert into readings_cql (k, reading_id, mode, setpoint, temp) values (1, 5, 'off', 70, 65);

This would give me back:

select * from readings_cql; k | reading_id | mode | setpoint | temp ---+------------+---------+----------+------ 1 | 5 | off | 70 | 65 0 | 0 | heating | 73 | 72 0 | 1 | cooling | 70 | 75

This looks more like your second (current) diagram.

Where as the thrift table represented as a CQL table looks more like your first (https://gist.github.com/newkek/4a0cbe91577886383aaa9ef89701cf03)

cqlsh> select * from ks.readings; key | column1 | column2 | value -----+---------+----------+--------- 1 | 5 | mode | off 1 | 5 | setpoint | 70 1 | 5 | temp | 65 0 | 0 | mode | heating 0 | 0 | setpoint | 73 0 | 0 | temp | 72 0 | 1 | mode | cooling 0 | 1 | setpoint | 70 0 | 1 | temp | 75

So what the One CQL Table will look like will ultimately depend on the CQL schema. Ultimately the stored data is similar, it's just how CQL determines what is a 'Row' depends on the clustering.

Comparing the data for readings (thrift) and readings_cql, the layout on disk is virtually the same (other than the values because my value type is different, and the ghost cell for CQL):

Thrift table: [default@ks] get readings[0]; => (name=0:mode, value=heating, timestamp=1474483110889000) => (name=0:setpoint, value=73, timestamp=1474483119088000) => (name=0:temp, value=72, timestamp=1474483101433000) => (name=1:mode, value=cooling, timestamp=1474483137460000) => (name=1:setpoint, value=70, timestamp=1474483147380000) => (name=1:temp, value=75, timestamp=1474483127437000) CQL table: [default@ks] get readings_cql[0]; => (name=0:, value=, timestamp=1474483980386000) => (name=0:mode, value=68656174696e67, timestamp=1474483980386000) => (name=0:setpoint, value=42920000, timestamp=1474483980386000) => (name=0:temp, value=42900000, timestamp=1474483980386000) => (name=1:, value=, timestamp=1474483996207000) => (name=1:mode, value=636f6f6c696e67, timestamp=1474483996207000) => (name=1:setpoint, value=428c0000, timestamp=1474483996207000) => (name=1:temp, value=42960000, timestamp=1474483996207000)

With all that defined, what was confusing to me is the Thrift Diagram (One Thrift COLUMNFAMILY) which is the same for both diagrams.

In your diagram you have three colors:

red: Thrift: Row Key, CQL: Partition Key

blue: Thrift: Column Comparator, CQL: Clustering Column Value

orange: Thrift: Column Value, CQL: Non-Clustering Column Value

If I'm interpreting that correctly and 'Col1' is a clustering column, I wouldn't expect it to be represented as its own column for Thrift. Instead if would be part of the column name in Thrift diagram where it would not for the CQL diagram.

I know that is maybe i'm getting too semantical and thinking too much about the storage model, but I think it is hard to divorce thrift from the storage model, as to me they are the same. One of the older thrift -> CQL guides uses the storage model to describe CQL and how different column family schemas would be modeled with CQL:

CQL3 (the Cassandra Query Language) provides a new API to work with Cassandra. Where the legacy thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL3 provides a thin abstraction layer over this internal structure.

Hopefully that makes sense, I had to revisit thrift to level set my thoughts.

Here's an example for inserting / retrieving data with that schema: https://gist.github.com/tolbertam/b12135fb4223896f14aa7c80426ae9a9

I'll admit that Composites are complicated, maybe a better example is a dynamic column family like the clicks example in this guide which maps to a cql table with 1 partition key, 1 clustering column, and 1 non-clustering column.

Thanks for the investigation. Yeah you're right Composite Column types are probably going to get more complicated however my first schema is not wrong it's just basic since it exposes the storage model in thrift but for a more basic use-case... As I said I would rather let more experienced people take care of explaining the more complicated aspects of the data model change for CQL by linking to other blog posts that were already written, rather than trying to explain all myself again in this docs...
Btw thanks for the code sample I had tried the composite columns in Astyanax a litle while ago but I couldn't get the CompositeColumn Serializer to work so that's cool

olim7t · 2016-09-14T18:29:15Z

+
+The two basic components in the Java driver are the `Cluster` and the `Session`.
+The `Cluster` is the object to create first, and on to which apply all
+global configuration options. Connecting to the `Cluster` creates a


suggestion: "on to which all global configuration options apply"

olim7t · 2016-09-14T18:35:46Z

+`Session`. Queries are executed through the `Session`.
+
+The `Cluster` object then is to be viewed as the equivalent of the `AstyanaxContext`
+object. ’Starting’ an `AstyanaxContext` object typically returns a `Keyspace`


[typo maniac warning] Surrounding "Starting" with two closing quotes looks ugly. If you use plain quotes in the source ('Starting'), documentor will do the right thing and generate an opening and closing quote. This works with double quotes as well.

olim7t · 2016-09-14T18:41:44Z

+
+### Connections pools internals
+Everything concerning the internal pools of connections to the *Cassandra nodes*
+will be gathered in the Java driver in the `PoolingOptions` :


Link to ../../../manual/pooling

olim7t · 2016-09-14T18:42:11Z

+
+Note that the *Java driver* allows multiple simultaneous requests on one single
+connection, as it is based upon the *Native protocol*, an asynchronous binary
+protocol that can handle up to 32768 simultaneous requests.


Link "native protocol" to ../../../manual/native_protocol

To my comment about not changing options, I think it would be good emphasize that configuring pooling with the java driver is less important because it allows multiple requests on a connection. There shouldn't be a compelling reason to increase the number of connections in the general case except for very high throughputs (can link to ../../../manual/pooling/#tuning-protocol-v3-for-very-high-throughputs).

olim7t · 2016-09-14T18:48:44Z

+to benefit from the *TokenAware* routing (the *Row key* in the *Java driver* is 
+referenced as *Routing Key*), unlike the *Astyanax* driver. 
+Some differences occur related to the different kinds of `Statements` the *Java
+driver* provides. Please see [this link](../../../manual/statements) for specific information.


It would be more appropriate to link to ../../../manual/load_balancing/#token-aware-policy, it details how to set the routing key for each statement type.

olim7t · 2016-09-14T18:51:27Z

+## Authentication
+
+Authentication settings are managed by the `AuthProvider` class in the *Java driver*.
+It can be highly customizable, but also comes with default simple implementations :


[typo maniac warning] remove space before column

olim7t · 2016-09-14T18:53:58Z

+
+A lot more options are available in the different `XxxxOption`s classes, policies are
+also highly customizable since the base drivers implementations can easily be 
+extended and implement users specific actions.


suggestion: user-specific actions

olim7t · 2016-09-14T18:54:13Z

+provide enough insight.
+
+A lot more options are available in the different `XxxxOption`s classes, policies are
+also highly customizable since the base drivers implementations can easily be 


only one driver?

olim7t · 2016-09-14T18:57:08Z

+for (Row row : rs) {
+   String value = row.getString("value");
+}
+```


👍 love those examples

tolbertam · 2016-09-14T18:32:02Z

+
+* [Changes at the language level](language_level_changes/)
+* [Migrating Astyanax configurations to DataStax Java driver configurations](configuration/)
+* [Querying and retrieving results comparisons.](queries_and_result/)


This should be queries_and_results I think

tolbertam · 2016-09-14T18:37:53Z

+composing the *Clustering ColumnKey*).
+
+Here is the basic architectural concept of *CQL*, a detailed explanation and *CQL*
+examples can be found in this article : [http://www.planetcassandra.org/making-the-change-from-thrift-to-cql/].


This appears somewhat weird in the generated docs:

tolbertam · 2016-09-14T19:16:54Z

+*CQL columns* (the key column in Figure 1). The *“Column”* part of the Column-value
+component in a *Thrift Row*, becomes the *Clustering ColumnKey* in *CQL*, and can
+also be composed of multiple columns (in the figure, column1 is the only column 
+composing the *Clustering ColumnKey*).


That was tripping me up as well, initially I was conflating how things are stored on disk with the chart here. I think it would be more clear if the chart represented the mental models for Thrift and CQL and Thrift has 1 row per Partition with the cell names like:

'a:col2', 'a:col3', 'b:col2', 'b:col3'

Where the CQL chart would represent multiple rows within a partition with each Row being 1 line each within a partition, where thrift is 1 line of rows overall within the partition.

tolbertam

Looks great! Had a few minor suggestions.

tolbertam · 2016-09-14T19:42:01Z

+PoolingOptions poolingOptions =
+       new PoolingOptions()
+           .setMaxRequestsPerConnection(HostDistance.LOCAL, 1024)
+           .setCoreConnectionsPerHost(HostDistance.LOCAL, 2)


I think it would be better to use setConnectionsPerHost(HostDistance.LOCAL, 2, 3), since I think we should encourage using that method.

tolbertam · 2016-09-14T19:44:44Z

+
+Note that the *Java driver* allows multiple simultaneous requests on one single
+connection, as it is based upon the *Native protocol*, an asynchronous binary
+protocol that can handle up to 32768 simultaneous requests.


To my comment about not changing options, I think it would be good emphasize that configuring pooling with the java driver is less important because it allows multiple requests on a connection. There shouldn't be a compelling reason to increase the number of connections in the general case except for very high throughputs (can link to ../../../manual/pooling/#tuning-protocol-v3-for-very-high-throughputs).

tolbertam · 2016-09-14T19:45:14Z

+*Java Driver :*
+
+```java
+SocketOptions so =


would be good to add note that timeouts should not be changed unless you are changing the timeouts in cassandra.yaml.

Hmm not 100% agreeing with that, as per previous discussions we have mentioned that some clients may want to "give up" on a request earlier than others, and so in this case it does not impact the C* yaml... I think it's better to keep it vague here, and I've clearly mentioned at the beginning that "if you change this options it means you know what you're doing", I might repeat it here

In the past we have advised users to not decrease SocketOptions#readTimeoutMillis and to depend on the cassandra side timeouts instead and in particular not to set readTimeoutMillis to a value less than the cassandra timeouts. If the client timeout is less than the cassandra timeout, you don't give RetryPolicy an opportunity to kick in.

tolbertam · 2016-09-14T19:47:38Z

+Configuring a `Cluster` works with the *Builder* pattern. The `Builder` takes all
+the configurations into account before building the `Cluster`.
+
+Following are some examples of the most important configurations that were 


It would be good to add a general comment that you should depend on the default configuration unless you have a good reason. It'd be nice to have a sentence after each section saying "You shouldn't need to configure this unless....". I think users often change a lot of configuration options because they see it in examples, but in reality they shouldn't need to.

tolbertam · 2016-09-14T20:02:58Z

+this case, setting the CL on the `PreparedStatement`, causes the `BoundStatements` to 
+inherit the CL from the prepared statements they were prepared from. More
+informations about how `Statement`s work in the *Java driver* are detailed
+in the [“Queries and Result” section](../queries_and_results/).


Should be "Queries and Results"

tolbertam · 2016-09-14T20:03:48Z

+*Thrift* exposes *Keyspaces*, and these *Keyspaces* contain *Column Families*. A
+*ColumnFamily* contains *Rows* in which each *Row* has a list of an arbitrary number
+of column-values. With *CQL*, the data is **tabular**, *ColumnFamily* gets viewed
+as a *Table*, the **Table Rows** get a **fixed and finite number of named columns**.


In one case Table Rows is bolded for emphasis, and in another case it's underlined, is that intentional?

Yep, wanted to emphasize the Table Rows difference...

tolbertam · 2016-09-14T20:05:28Z

@@ -0,0 +1,106 @@
+# Queries and Results
+There are many ressources such as [this post][planetCCqlLink] or [this post][dsBlogCqlLink] to learn


ressources -> resources

tolbertam · 2016-09-14T20:15:02Z

+The *Java driver* executes CQL queries through the `Session`. 
+The queries can either be simple *CQL* Strings or represented in the form of 
+`Statement`s. The driver offers 4 kinds of statements, `SimpleStatement`, 
+`Prepared/BoundStatement`, `BuiltStatement`, `BatchStatement`. All necessary 


, and BatchStatement

tolbertam · 2016-09-14T20:15:14Z

+The queries can either be simple *CQL* Strings or represented in the form of 
+`Statement`s. The driver offers 4 kinds of statements, `SimpleStatement`, 
+`Prepared/BoundStatement`, `BuiltStatement`, `BatchStatement`. All necessary 
+information can be [found here](../../../manual/statements/) about the natures of the different


natures -> nature

tolbertam · 2016-09-14T20:15:52Z

+information can be [found here](../../../manual/statements/) about the natures of the different
+`Statement`s.
+
+As explained in [this documentation section](../../../manual/#running-queries),


suggest renaming 'this documentation section' 'the 'running queries' section'

newkek · 2016-09-20T11:29:25Z

Looks like I've corrected all comments, except the schema. Seems like there are 3 different suggestions there... Please note that I did not intend to go into much details in the CQL level changes and stay "basic" and give a "general idea" as I have linked to other posts that can explain those CQL changes much better and more detailed than I can do and it is not the intention of this doc to focus on CQL. I'd rather go back to the schema in https://gist.github.com/newkek/4a0cbe91577886383aaa9ef89701cf03 because it's closer to what CQLSH shows and what CQL queries look like.

olim7t · 2016-09-20T21:50:05Z

Yes I prefer your other schema too, because it illustrates the case where a single Thrift row translates to multiple CQL rows, which was the main concern I think.

olim7t · 2016-09-21T16:41:15Z

LGTM pending the schema. I saw a couple minor issues but I addressed them directly.

newkek · 2016-09-23T16:53:32Z

Ok so after private discussions with @tolbertam I pushed a revised version of the first schema in https://gist.github.com/newkek/4a0cbe91577886383aaa9ef89701cf03 that should be slightly clearer. Seems to me we're all done for that PR.

tolbertam · 2016-09-23T17:48:22Z

The changes look great! 👍

olim7t · 2016-09-26T20:46:51Z

Squashed and rebased on top of 3.0.x.

olim7t added this to the 3.0.4 milestone Aug 30, 2016

adutra reviewed Sep 9, 2016
View reviewed changes

adutra added the reviewed_AD label Sep 9, 2016

newkek force-pushed the astyanax-docs branch from 3b196fc to 087237d Compare September 13, 2016 10:04

newkek changed the base branch from 3.0 to 3.x September 13, 2016 10:04

olim7t suggested changes Sep 14, 2016

View reviewed changes

tolbertam reviewed Sep 14, 2016

View reviewed changes

tolbertam suggested changes Sep 14, 2016

View reviewed changes

olim7t approved these changes Sep 21, 2016

View reviewed changes

tolbertam approved these changes Sep 22, 2016

View reviewed changes

tolbertam added the reviewed_AT label Sep 22, 2016

olim7t mentioned this pull request Sep 26, 2016

Bad link #751

Closed

Astyanax upgrade guide.

f17c5c1

olim7t force-pushed the astyanax-docs branch from 0870983 to f17c5c1 Compare September 26, 2016 20:45

olim7t changed the base branch from 3.x to 3.0.x September 26, 2016 20:46

olim7t merged commit af819c5 into 3.0.x Sep 26, 2016

olim7t deleted the astyanax-docs branch September 26, 2016 20:50

		@@ -0,0 +1,106 @@
		# Queries and Results
		There are many ressources such as [this post][planetCCqlLink] or [this post][dsBlogCqlLink] to learn

Conversation

newkek commented Aug 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

newkek commented Sep 13, 2016

Uh oh!

olim7t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

newkek Sep 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olim7t Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olim7t Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolbertam Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

tolbertam Sep 14, 2016 •

edited

Loading

tolbertam Sep 14, 2016 •

edited

Loading

tolbertam Sep 21, 2016 •

edited

Loading

tolbertam Sep 21, 2016 •

edited

Loading

newkek Sep 22, 2016 •

edited

Loading

olim7t Sep 14, 2016 •

edited

Loading

tolbertam Sep 14, 2016 •

edited

Loading

olim7t Sep 14, 2016 •

edited

Loading

tolbertam Sep 14, 2016 •

edited

Loading

tolbertam Sep 14, 2016 •

edited

Loading

tolbertam Sep 14, 2016 •

edited

Loading