Astyanax docs#732
Conversation
| ## How Configuring the Java driver works | ||
|
|
||
| The two basic components in the Java driver are the `Cluster` and the `Session`. | ||
| The `Cluster` is the object to create first and on which to will apply all |
There was a problem hiding this comment.
This needs to be rephrased: "on which to will apply"
3b196fc to
087237d
Compare
|
Rebased on top of current 3.x and squashed. |
olim7t
left a comment
There was a problem hiding this comment.
My main comment is about the Thrift to CQL schema.
The others are trivial, I can address them if you want.
| *CQL columns* (the key column in Figure 1). The *“Column”* part of the Column-value | ||
| component in a *Thrift Row*, becomes the *Clustering ColumnKey* in *CQL*, and can | ||
| also be composed of multiple columns (in the figure, column1 is the only column | ||
| composing the *Clustering ColumnKey*). |
There was a problem hiding this comment.
In the thrift example, are "col1, col2, col3" the cell names, and "a, b, c" the cell values? If so, I don't agree with the second part of the schema, cell names should translate to different values of the clustering column.
See the "Dynamic Column family" example in this blog post: the thrift cell names (timestamps) become the values of the "time" CQL column. The text says
map [...] the second component [of the CQL primary key, i.e. the clustering column] (time) to the internal cell name. And the last CQL3 column (url) will be mapped to the cell value"
It gets more complicated with multiple clustering columns and multiple non-primary key rows (see the "Composite" and "Non-compact tables" examples respectively), but the value of the CQL clustering column always goes into the name of the internal cell.
I think the example should also illustrate a case where a single Thrift wide row translates to multiple CQL rows, it seems like that's a common hurdle for people coming from Thrift.
There was a problem hiding this comment.
That was tripping me up as well, initially I was conflating how things are stored on disk with the chart here. I think it would be more clear if the chart represented the mental models for Thrift and CQL and Thrift has 1 row per Partition with the cell names like:
'a:col2', 'a:col3', 'b:col2', 'b:col3'
Where the CQL chart would represent multiple rows within a partition with each Row being 1 line each within a partition, where thrift is 1 line of rows overall within the partition.
There was a problem hiding this comment.
Initially the schema was like this: https://gist.github.com/newkek/4a0cbe91577886383aaa9ef89701cf03
I changed it to this version after a discussion with @adutra, I think it kind of represents more "the concept" even though it's not the exact memory representation. I don't mind changing it back as I believe it is rather subjective. Also noticing that I updated the schema without updating the explanation so actually currently both don't match.
There was a problem hiding this comment.
I like the current version compared to the original diagram. The two suggestions I would have:
- The column/cell names in the thrift diagram should be clusteringkeyvalue:columnname (i.e. a:col2, a:col3 and so on). The clustering key shouldn't be it's own column.
- I think it would be good to show what it would look like though if you had multiple rows per partition. The current diagram only has 1 cql row/unique cluster per partition, it would be good to demonstrate what it would look like if there were at least 2. (i.e. have two rows for key=1 where key=1,col1=a and key=1,col1=b). Hopefully that makes sense.
There was a problem hiding this comment.
What is really confusing to me is that with the pre-3.0 storage engine, whether or not you use CQL or Thrift, the way data is stored is a 1:1 with how I visualize Thrift. Where with the 3.0 storage engine, it is nearly 1:1 with how I visualize CQL. That's why I think it is very easy for the reader to conflate the data model with how things are stored.
There was a problem hiding this comment.
Maybe i'm interpreting that diagram wrong, but isn't there only 1 clustering value for the thrift diagram for each row (if Col : 1 is the clustering value)?
There was a problem hiding this comment.
Hm I see what you mean, was it possible to have two "Col:1" for the same Row Key in Thrift? I don't see how I can do that, with Astyanax at least, I tried inserting twice for the Same Row Key and The Same "Col:x" 2 different values, and one erases the other, since the "Col:" is part of the Clustering key
There was a problem hiding this comment.
Sorry, it's easy to conflate terms so i'm probably confusing things a bit.
Taking a step back to describe what I'm trying to say with an example data model. I was thinking of the table that you were sharing having a composite comparator that was CompositeType(Int32, UTF8Type) where Int32 is the value of Col1, and UTF8Type is the name of the column.
As an example, say you want to represent readings from a thermostat. The thermostat is identified by an integer id which will be our row key. A thermostat has 'readings' identified by a reading id, and each reading has a temp (for the current temperature), a mode (cooling, heating, off) and a setpoint (for target temperature).
Using cassandra-cli, i could create a schema like so:
create keyspace ks;
use ks;
create column family readings
with key_validation_class = Int32Type
and comparator = 'CompositeType(Int32Type, UTF8Type)'
and default_validation_class = UTF8Type;
I could get even fancier and provide column metadata, but leaving this as is to keep it simple.
I could then create multiple readings for an individual thermostat, i.e.:
set readings[0]['0:temp'] = '72';
set readings[0]['0:mode'] = 'heating';
set readings[0]['0:setpoint'] = '73';
set readings[0]['1:temp'] = '75';
set readings[0]['1:mode'] = 'cooling';
set readings[0]['1:setpoint'] = '70';
set readings[1]['5:temp'] = '65';
set readings[1]['5:mode'] = 'off';
set readings[1]['5:setpoint'] = '70';
When reading back the data, I get:
[default@ks] get readings[0];
=> (name=0:mode, value=heating, timestamp=1474483110889000)
=> (name=0:setpoint, value=73, timestamp=1474483119088000)
=> (name=0:temp, value=72, timestamp=1474483101433000)
=> (name=1:mode, value=cooling, timestamp=1474483137460000)
=> (name=1:setpoint, value=70, timestamp=1474483147380000)
=> (name=1:temp, value=75, timestamp=1474483127437000)
[default@ks] get readings[1];
=> (name=5:mode, value=off, timestamp=1474483198322000)
=> (name=5:setpoint, value=70, timestamp=1474483205968000)
=> (name=5:temp, value=65, timestamp=1474483192274000)
In that example, for a Thrift row for a given key, I have multiple readings. In CQL, The concept of a row layout would depend on the clustering, but I would preferably model this like:
create table readings_cql (k int, reading_id int, mode text, setpoint float, temp float, primary key (k, reading_id));
-- insert data
insert into readings_cql (k, reading_id, mode, setpoint, temp) values (0, 0, 'heating', 73, 72);
insert into readings_cql (k, reading_id, mode, setpoint, temp) values (0, 1, 'cooling', 70, 75);
insert into readings_cql (k, reading_id, mode, setpoint, temp) values (1, 5, 'off', 70, 65);
This would give me back:
select * from readings_cql;
k | reading_id | mode | setpoint | temp
---+------------+---------+----------+------
1 | 5 | off | 70 | 65
0 | 0 | heating | 73 | 72
0 | 1 | cooling | 70 | 75
This looks more like your second (current) diagram.
Where as the thrift table represented as a CQL table looks more like your first (https://gist.github.com/newkek/4a0cbe91577886383aaa9ef89701cf03)
cqlsh> select * from ks.readings;
key | column1 | column2 | value
-----+---------+----------+---------
1 | 5 | mode | off
1 | 5 | setpoint | 70
1 | 5 | temp | 65
0 | 0 | mode | heating
0 | 0 | setpoint | 73
0 | 0 | temp | 72
0 | 1 | mode | cooling
0 | 1 | setpoint | 70
0 | 1 | temp | 75
So what the One CQL Table will look like will ultimately depend on the CQL schema. Ultimately the stored data is similar, it's just how CQL determines what is a 'Row' depends on the clustering.
Comparing the data for readings (thrift) and readings_cql, the layout on disk is virtually the same (other than the values because my value type is different, and the ghost cell for CQL):
Thrift table:
[default@ks] get readings[0];
=> (name=0:mode, value=heating, timestamp=1474483110889000)
=> (name=0:setpoint, value=73, timestamp=1474483119088000)
=> (name=0:temp, value=72, timestamp=1474483101433000)
=> (name=1:mode, value=cooling, timestamp=1474483137460000)
=> (name=1:setpoint, value=70, timestamp=1474483147380000)
=> (name=1:temp, value=75, timestamp=1474483127437000)
CQL table:
[default@ks] get readings_cql[0];
=> (name=0:, value=, timestamp=1474483980386000)
=> (name=0:mode, value=68656174696e67, timestamp=1474483980386000)
=> (name=0:setpoint, value=42920000, timestamp=1474483980386000)
=> (name=0:temp, value=42900000, timestamp=1474483980386000)
=> (name=1:, value=, timestamp=1474483996207000)
=> (name=1:mode, value=636f6f6c696e67, timestamp=1474483996207000)
=> (name=1:setpoint, value=428c0000, timestamp=1474483996207000)
=> (name=1:temp, value=42960000, timestamp=1474483996207000)
With all that defined, what was confusing to me is the Thrift Diagram (One Thrift COLUMNFAMILY) which is the same for both diagrams.
In your diagram you have three colors:
- red: Thrift: Row Key, CQL: Partition Key
- blue: Thrift: Column Comparator, CQL: Clustering Column Value
- orange: Thrift: Column Value, CQL: Non-Clustering Column Value
If I'm interpreting that correctly and 'Col1' is a clustering column, I wouldn't expect it to be represented as its own column for Thrift. Instead if would be part of the column name in Thrift diagram where it would not for the CQL diagram.
I know that is maybe i'm getting too semantical and thinking too much about the storage model, but I think it is hard to divorce thrift from the storage model, as to me they are the same. One of the older thrift -> CQL guides uses the storage model to describe CQL and how different column family schemas would be modeled with CQL:
CQL3 (the Cassandra Query Language) provides a new API to work with Cassandra. Where the legacy thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL3 provides a thin abstraction layer over this internal structure.
Hopefully that makes sense, I had to revisit thrift to level set my thoughts.
There was a problem hiding this comment.
Here's an example for inserting / retrieving data with that schema: https://gist.github.com/tolbertam/b12135fb4223896f14aa7c80426ae9a9
I'll admit that Composites are complicated, maybe a better example is a dynamic column family like the clicks example in this guide which maps to a cql table with 1 partition key, 1 clustering column, and 1 non-clustering column.
There was a problem hiding this comment.
Thanks for the investigation. Yeah you're right Composite Column types are probably going to get more complicated however my first schema is not wrong it's just basic since it exposes the storage model in thrift but for a more basic use-case... As I said I would rather let more experienced people take care of explaining the more complicated aspects of the data model change for CQL by linking to other blog posts that were already written, rather than trying to explain all myself again in this docs...
Btw thanks for the code sample I had tried the composite columns in Astyanax a litle while ago but I couldn't get the CompositeColumn Serializer to work so that's cool
|
|
||
| The two basic components in the Java driver are the `Cluster` and the `Session`. | ||
| The `Cluster` is the object to create first, and on to which apply all | ||
| global configuration options. Connecting to the `Cluster` creates a |
There was a problem hiding this comment.
suggestion: "on to which all global configuration options apply"
| `Session`. Queries are executed through the `Session`. | ||
|
|
||
| The `Cluster` object then is to be viewed as the equivalent of the `AstyanaxContext` | ||
| object. ’Starting’ an `AstyanaxContext` object typically returns a `Keyspace` |
There was a problem hiding this comment.
[typo maniac warning] Surrounding "Starting" with two closing quotes looks ugly. If you use plain quotes in the source ('Starting'), documentor will do the right thing and generate an opening and closing quote. This works with double quotes as well.
|
|
||
| ### Connections pools internals | ||
| Everything concerning the internal pools of connections to the *Cassandra nodes* | ||
| will be gathered in the Java driver in the `PoolingOptions` : |
There was a problem hiding this comment.
Link to ../../../manual/pooling
|
|
||
| Note that the *Java driver* allows multiple simultaneous requests on one single | ||
| connection, as it is based upon the *Native protocol*, an asynchronous binary | ||
| protocol that can handle up to 32768 simultaneous requests. |
There was a problem hiding this comment.
Link "native protocol" to ../../../manual/native_protocol
There was a problem hiding this comment.
To my comment about not changing options, I think it would be good emphasize that configuring pooling with the java driver is less important because it allows multiple requests on a connection. There shouldn't be a compelling reason to increase the number of connections in the general case except for very high throughputs (can link to ../../../manual/pooling/#tuning-protocol-v3-for-very-high-throughputs).
| to benefit from the *TokenAware* routing (the *Row key* in the *Java driver* is | ||
| referenced as *Routing Key*), unlike the *Astyanax* driver. | ||
| Some differences occur related to the different kinds of `Statements` the *Java | ||
| driver* provides. Please see [this link](../../../manual/statements) for specific information. |
There was a problem hiding this comment.
It would be more appropriate to link to ../../../manual/load_balancing/#token-aware-policy, it details how to set the routing key for each statement type.
| ## Authentication | ||
|
|
||
| Authentication settings are managed by the `AuthProvider` class in the *Java driver*. | ||
| It can be highly customizable, but also comes with default simple implementations : |
There was a problem hiding this comment.
[typo maniac warning] remove space before column
|
|
||
| A lot more options are available in the different `XxxxOption`s classes, policies are | ||
| also highly customizable since the base drivers implementations can easily be | ||
| extended and implement users specific actions. |
There was a problem hiding this comment.
suggestion: user-specific actions
| provide enough insight. | ||
|
|
||
| A lot more options are available in the different `XxxxOption`s classes, policies are | ||
| also highly customizable since the base drivers implementations can easily be |
| for (Row row : rs) { | ||
| String value = row.getString("value"); | ||
| } | ||
| ``` |
|
|
||
| * [Changes at the language level](language_level_changes/) | ||
| * [Migrating Astyanax configurations to DataStax Java driver configurations](configuration/) | ||
| * [Querying and retrieving results comparisons.](queries_and_result/) |
There was a problem hiding this comment.
This should be queries_and_results I think
| composing the *Clustering ColumnKey*). | ||
|
|
||
| Here is the basic architectural concept of *CQL*, a detailed explanation and *CQL* | ||
| examples can be found in this article : [http://www.planetcassandra.org/making-the-change-from-thrift-to-cql/]. |
| *CQL columns* (the key column in Figure 1). The *“Column”* part of the Column-value | ||
| component in a *Thrift Row*, becomes the *Clustering ColumnKey* in *CQL*, and can | ||
| also be composed of multiple columns (in the figure, column1 is the only column | ||
| composing the *Clustering ColumnKey*). |
There was a problem hiding this comment.
That was tripping me up as well, initially I was conflating how things are stored on disk with the chart here. I think it would be more clear if the chart represented the mental models for Thrift and CQL and Thrift has 1 row per Partition with the cell names like:
'a:col2', 'a:col3', 'b:col2', 'b:col3'
Where the CQL chart would represent multiple rows within a partition with each Row being 1 line each within a partition, where thrift is 1 line of rows overall within the partition.
tolbertam
left a comment
There was a problem hiding this comment.
Looks great! Had a few minor suggestions.
| PoolingOptions poolingOptions = | ||
| new PoolingOptions() | ||
| .setMaxRequestsPerConnection(HostDistance.LOCAL, 1024) | ||
| .setCoreConnectionsPerHost(HostDistance.LOCAL, 2) |
There was a problem hiding this comment.
I think it would be better to use setConnectionsPerHost(HostDistance.LOCAL, 2, 3), since I think we should encourage using that method.
|
|
||
| Note that the *Java driver* allows multiple simultaneous requests on one single | ||
| connection, as it is based upon the *Native protocol*, an asynchronous binary | ||
| protocol that can handle up to 32768 simultaneous requests. |
There was a problem hiding this comment.
To my comment about not changing options, I think it would be good emphasize that configuring pooling with the java driver is less important because it allows multiple requests on a connection. There shouldn't be a compelling reason to increase the number of connections in the general case except for very high throughputs (can link to ../../../manual/pooling/#tuning-protocol-v3-for-very-high-throughputs).
| *Java Driver :* | ||
|
|
||
| ```java | ||
| SocketOptions so = |
There was a problem hiding this comment.
would be good to add note that timeouts should not be changed unless you are changing the timeouts in cassandra.yaml.
There was a problem hiding this comment.
Hmm not 100% agreeing with that, as per previous discussions we have mentioned that some clients may want to "give up" on a request earlier than others, and so in this case it does not impact the C* yaml... I think it's better to keep it vague here, and I've clearly mentioned at the beginning that "if you change this options it means you know what you're doing", I might repeat it here
There was a problem hiding this comment.
In the past we have advised users to not decrease SocketOptions#readTimeoutMillis and to depend on the cassandra side timeouts instead and in particular not to set readTimeoutMillis to a value less than the cassandra timeouts. If the client timeout is less than the cassandra timeout, you don't give RetryPolicy an opportunity to kick in.
| Configuring a `Cluster` works with the *Builder* pattern. The `Builder` takes all | ||
| the configurations into account before building the `Cluster`. | ||
|
|
||
| Following are some examples of the most important configurations that were |
There was a problem hiding this comment.
It would be good to add a general comment that you should depend on the default configuration unless you have a good reason. It'd be nice to have a sentence after each section saying "You shouldn't need to configure this unless....". I think users often change a lot of configuration options because they see it in examples, but in reality they shouldn't need to.
| this case, setting the CL on the `PreparedStatement`, causes the `BoundStatements` to | ||
| inherit the CL from the prepared statements they were prepared from. More | ||
| informations about how `Statement`s work in the *Java driver* are detailed | ||
| in the [“Queries and Result” section](../queries_and_results/). |
There was a problem hiding this comment.
Should be "Queries and Results"
| *Thrift* exposes *Keyspaces*, and these *Keyspaces* contain *Column Families*. A | ||
| *ColumnFamily* contains *Rows* in which each *Row* has a list of an arbitrary number | ||
| of column-values. With *CQL*, the data is **tabular**, *ColumnFamily* gets viewed | ||
| as a *Table*, the **Table Rows** get a **fixed and finite number of named columns**. |
There was a problem hiding this comment.
In one case Table Rows is bolded for emphasis, and in another case it's underlined, is that intentional?
There was a problem hiding this comment.
Yep, wanted to emphasize the Table Rows difference...
| @@ -0,0 +1,106 @@ | |||
| # Queries and Results | |||
| There are many ressources such as [this post][planetCCqlLink] or [this post][dsBlogCqlLink] to learn | |||
| The *Java driver* executes CQL queries through the `Session`. | ||
| The queries can either be simple *CQL* Strings or represented in the form of | ||
| `Statement`s. The driver offers 4 kinds of statements, `SimpleStatement`, | ||
| `Prepared/BoundStatement`, `BuiltStatement`, `BatchStatement`. All necessary |
| The queries can either be simple *CQL* Strings or represented in the form of | ||
| `Statement`s. The driver offers 4 kinds of statements, `SimpleStatement`, | ||
| `Prepared/BoundStatement`, `BuiltStatement`, `BatchStatement`. All necessary | ||
| information can be [found here](../../../manual/statements/) about the natures of the different |
| information can be [found here](../../../manual/statements/) about the natures of the different | ||
| `Statement`s. | ||
|
|
||
| As explained in [this documentation section](../../../manual/#running-queries), |
There was a problem hiding this comment.
suggest renaming 'this documentation section' 'the 'running queries' section'
|
Looks like I've corrected all comments, except the schema. Seems like there are 3 different suggestions there... Please note that I did not intend to go into much details in the CQL level changes and stay "basic" and give a "general idea" as I have linked to other posts that can explain those CQL changes much better and more detailed than I can do and it is not the intention of this doc to focus on CQL. I'd rather go back to the schema in https://gist.github.com/newkek/4a0cbe91577886383aaa9ef89701cf03 because it's closer to what CQLSH shows and what CQL queries look like. |
|
Yes I prefer your other schema too, because it illustrates the case where a single Thrift row translates to multiple CQL rows, which was the main concern I think. |
|
LGTM pending the schema. I saw a couple minor issues but I addressed them directly. |
|
Ok so after private discussions with @tolbertam I pushed a revised version of the first schema in https://gist.github.com/newkek/4a0cbe91577886383aaa9ef89701cf03 that should be slightly clearer. Seems to me we're all done for that PR. |
|
The changes look great! 👍 |
0870983 to
f17c5c1
Compare
|
Squashed and rebased on top of 3.0.x. |

No description provided.