You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/ROOT/pages/reading.adoc
+5
Original file line number
Diff line number
Diff line change
@@ -606,3 +606,8 @@ Where:
606
606
* `<target_labels>` is the list of labels provided by `relationship.target.labels` option
607
607
* `<relationship>` is the list of labels provided by `relationship` option
608
608
* `<limit>` is the value provided via `schema.flatten.limit`
609
+
610
+
=== Performance considerations
611
+
612
+
If the schema is not specified, the Spark Connector uses sampling as explained xref:quickstart.adoc#_schema[here] and xref:architecture.adoc#_schema_considerations[here].
613
+
Since sampling is potentially an expensive operation, consider xref:quickstart.adoc#user-defined-schema[supplying your own schema].
Copy file name to clipboardExpand all lines: modules/ROOT/pages/writing.adoc
+98-9
Original file line number
Diff line number
Diff line change
@@ -189,6 +189,7 @@ Writing data to a Neo4j database can be done in three ways:
189
189
190
190
In case you use the option `query`, the Spark Connector persists the entire Dataset by using the provided query.
191
191
The nodes are sent to Neo4j in a batch of rows defined in the `batch.size` property, and your query is wrapped up in an `UNWIND $events AS event` statement.
192
+
The `query` option supports both `CREATE` and `MERGE` clauses.
Thus `events` is the batch created from your dataset.
219
233
234
+
==== Considerations
235
+
236
+
* You must always specify the <<save-mode>>.
237
+
238
+
* You can use the `events` list in `WITH` statements as well.
239
+
For example, you can replace the query in the previous example with the following:
240
+
+
241
+
[source, cypher]
242
+
----
243
+
WITH event.name + ' ' + toUpper(event.surname) AS fullName
244
+
CREATE (n:Person {fullName: fullName})
245
+
----
246
+
247
+
* Subqueries that reference the `events` list in ``CALL``s are supported:
248
+
+
249
+
[source, cypher]
250
+
----
251
+
CALL {
252
+
WITH event
253
+
RETURN event.name + ' ' + toUpper(event.surname) AS fullName
254
+
}
255
+
CREATE (n:Person {fullName: fullName})
256
+
----
257
+
258
+
* If APOC is installed, APOC procedures and functions can be used:
259
+
+
260
+
[source, cypher]
261
+
----
262
+
CALL {
263
+
WITH event
264
+
RETURN event.name + ' ' + apoc.text.toUpperCase(event.surname) AS fullName
265
+
}
266
+
CREATE (n:Person {fullName: fullName})
267
+
----
268
+
269
+
* Although a `RETURN` clause is not forbidden, adding one does not have any effect on the query result.
270
+
220
271
[[write-node]]
221
272
=== Node
222
273
@@ -351,6 +402,11 @@ Neo4j Connector for Apache Spark flattens the maps, and each map value is in it'
351
402
352
403
You can write a DataFrame to Neo4j by specifying source, target nodes, and relationships.
353
404
405
+
[WARNING]
406
+
====
407
+
To avoid deadlocks, always use a single partition (for example with `coalesce(1)`) before writing relationships to Neo4j.
408
+
====
409
+
354
410
==== Overview
355
411
356
412
Before diving into the actual process, let's clarify the vocabulary first. Since this method of writing data to Neo4j is more complex and few combinations of options can be used, let's spend more time on explaining it.
@@ -417,6 +473,7 @@ val originalDf = spark.read.format("org.neo4j.spark.DataSource")
@@ -624,6 +684,10 @@ You can set the optimization via `schema.optimization.type` option that works on
624
684
* `INDEX`: it creates only indexes on provided nodes.
625
685
* `NODE_CONSTRAINTS`: it creates only indexes on provided nodes.
626
686
687
+
[IMPORTANT]
688
+
The `schema.optimization.type` option cannot be used with the `query` option.
689
+
If you are using a <<write-query, custom Cypher query>>, you need to create indexes and constraints manually using the <<script-option, `script` option>>.
690
+
627
691
628
692
==== Index creation
629
693
@@ -646,12 +710,16 @@ Before the import starts, the following schema query is being created:
646
710
CREATE INDEX ON :Person(surname)
647
711
----
648
712
649
-
*Take into consideration that the first label is used for the index creation.*
713
+
The name of the created index is `spark_INDEX_<LABEL>_<NODE_KEYS>`, where `<LABEL>` is the first label from the `labels` option and `<NODE_KEYS>` is a dash-separated sequence of one or more properties as specified in the `node.keys` options.
714
+
In this example, the name of the created index is `spark_INDEX_Person_surname`.
715
+
If the `node.keys` option were set to `"name,surname"` instead, the index name would be `spark_INDEX_Person_name-surname`.
716
+
717
+
The index is not recreated if it is already present.
650
718
651
719
652
720
==== Constraint creation
653
721
654
-
Below you can see an example of how to create indexes while you're creating nodes.
722
+
Below you can see an example of how to create constraints while you're creating nodes.
655
723
656
724
----
657
725
ds.write
@@ -670,8 +738,13 @@ Before the import starts, the code above creates the following schema query:
670
738
CREATE CONSTRAINT FOR (p:Person) REQUIRE (p.surname) IS UNIQUE
671
739
----
672
740
673
-
*Take into consideration that the first label is used for the index creation.*
741
+
The name of the created constraint is `spark_NODE_CONSTRAINTS_<LABEL>_<NODE_KEYS>`, where `<LABEL>` is the first label from the `labels` option and `<NODE_KEYS>` is a dash-separated sequence of one or more properties as specified in the `node.keys` options.
742
+
In this example, the name of the created constraint is `spark_NODE_CONSTRAINTS_Person_surname`.
743
+
If the `node.keys` option were set to `"name,surname"` instead, the constraint name would be `spark_NODE_CONSTRAINTS_Person_name-surname`.
674
744
745
+
The constraint is not recreated if it is already present.
746
+
747
+
[[script-option]]
675
748
=== Script option
676
749
677
750
The script option allows you to execute a series of preparation script before Spark
`scriptResult` is the result from the last query contained within the `script` options
710
783
that is `RETURN 36 AS age;`
711
784
785
+
=== Performance considerations
786
+
787
+
Since writing is typically an expensive operation, make sure you write only the columns you need from the DataFrame.
788
+
For example, if the columns from the data source are `name`, `surname`, `age`, and `livesIn`, but you only need `name` and `surname`, you can do the following:
789
+
790
+
[source, scala]
791
+
----
792
+
ds.select(ds("name"), ds("surname"))
793
+
.write
794
+
.format("org.neo4j.spark.DataSource")
795
+
.mode(SaveMode.ErrorIfExists)
796
+
.option("url", "bolt://localhost:7687")
797
+
.option("labels", ":Person:Customer")
798
+
.save()
799
+
----
800
+
712
801
== Note about columns with Map type
713
802
714
803
When a Dataframe column is a map, what we do internally is to flatten the map as Neo4j does not support this type for graph entity properties; so for a Spark job like this:
0 commit comments