ETL: CSV to Neo4j
Table of Contents
Load data from CSV into Neo4j using Cypher (LOAD CSV command).
Used drivers:
- CSV
- Neo4j or Neo4jCsvNodes with Neo4jCsvRelationships
You can use Neo4jCsvNodes and Neo4jCsvRelationships to automatically generate Cypher for you or you can use Neo4j driver to execute custom Cypher. See examples below. Both examples import same data and the result in Neo4j is the same. Here are CSV files which are used in the examples.
Nodes
logicalName;type;subtype;comment;vendor;responsible
Server1;Server;Physical;HP server;HP;ITOPS
Server2;Server;Physical;HP server;HP;ITOPS
App1;Application modul;Application;IB application;Microsoft;MS support
Server3;Server;Physical;HP server;HP;ITOPS
App2;Application modul;Application;IB application;Microsoft;MS support
Relationships
sourceLogicalName;targetLogicalName;relationshipName;type;subtype
App1;Server1;App1-Server1;Physical;Uses
App2;Server2;App2-Server2;Physical;Uses
Import using Neo4j driver
This import uses the Neo4j driver which can execute a Cypher command. To import CSV we can use the LOAD CSV command.
Advantage: Changes in the name of properties or adding calculations or conditions to the Cypher is possible.
Disadvantage: You need to write long Cypher to set every column of CSV. If new columns will be added to the CSV file it's needed to change the Cypher in the ETL job also. If you want to set the same property names as CSV column names and don't want to map anything it's better to use Neo4jCsvNodes and Neo4jCsvRelationships drivers instead of Neo4j driver.
<!DOCTYPE etl SYSTEM "
https://scriptella.org/dtd/etl.dtd
">
<
etl
>
<
description
>Load CSV nodes and rels into Neo4j</
description
>
<
properties
>
csv.file1=/nodes.csv
csv.file2=/rels.csv
</
properties
>
<
connection
id
=
"neo4j"
driver
=
"neo4j"
url
=
"bolt://localhost:7687"
user
=
"neo4j"
password
=
"admin"
/>
<
script
connection-id
=
"neo4j"
>
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM 'file:$csv.file1' AS line FIELDTERMINATOR ';'
MERGE (n:Ci{logicalName:line.logicalName})
SET
n.logicalName=line.logicalName,
n.type=line.type,
n.subtype=line.subtype,
n.comment=line.comment,
n.vendor=line.vendor,
n.responsible=line.responsible,
n._created=timestamp()
</
script
>
<
script
connection-id
=
"neo4j"
>
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM 'file:$csv.file2' AS line FIELDTERMINATOR ';'
MATCH (n1:Ci{logicalName:line.sourceLogicalName}),(n2:Ci{logicalName:line.targetLogicalName}) MERGE (n1)-[r:RELATED{relationshipName:line.relationshipName}]->(n2)
SET
r.type=line.type,
r.subtype=line.subtype,
r._created=timestamp()
</
script
>
</
etl
>
Import using Neo4jCsvNodes and Neo4jCsvRelationships driver
This import uses Neo4jCsvNodes and Neo4jCsvRelationships drivers which automatically generate LOAD CSV Cypher for you in a background. There is no need to write Cypher to set every column of CSV.
Advantage: Short ETL script. There is no need to change ETL job if some columns are added into CSV.
Disadvantage: Cypher is generated in a background so you can't modify this Cypher to add conditions or something else. But you can use Neo4j driver after Neo4jCsvNodes and Neo4jCsvRelationships drivers to execute some custom Cypher and modify data in Neo4j.
<!DOCTYPE etl SYSTEM "
https://scriptella.org/dtd/etl.dtd
">
<
etl
>
<
description
>Load CSV nodes and rels into Neo4j</
description
>
<
properties
>
csv.file1=/nodes.csv
csv.file2=/rels.csv
</
properties
>
<
connection
id
=
"nodeImport"
driver
=
"neo4jCsvNodes"
url
=
"bolt://localhost:7687"
user
=
"neo4j"
password
=
"admin"
>
node_labels=Ci
merge_columns=logicalName
empty_string_is_null=true
timestamp_properties=_created,_updated
csv_path=$csv.file1
</
connection
>
<
connection
id
=
"relationshipImport"
driver
=
"neo4jCsvRelationships"
url
=
"bolt://localhost:7687"
user
=
"neo4j"
password
=
"admin"
>
source_node_labels=Ci
target_node_labels=Ci
relationship_type=RELATED
source_node_match_columns=logicalName:sourceLogicalName
target_node_match_columns=logicalName:targetLogicalName
relationship_merge_columns=relationshipName
empty_string_is_null=true
timestamp_properties=_created,_updated
csv_path=$csv.file2
</
connection
>
<
script
connection-id
=
"nodeImport"
/>
<
script
connection-id
=
"relationshipImport"
/>
</
etl
>
Import using Neo4j driver without LOAD CSV
This import uses the Neo4j driver which can execute a Cypher command. To import CSV we don't use the LOAD CSV command in this case.
Advantage: You can make some custom CREATE or MERGE statements. For example to customize what node label will be created or what relationship type will be created. Node label and relationship type can't be customized in LOAD CSV (label and relationship type has to be a constant string). For example, you can write Cypher query:
Disadvantage: This approach is not very performant and can be used only on small datasets (less than 50.000 nodes and 50.000 relationships).
MERGE (n:
'$type'
{id:toint(
'$id'
)}) SET n.logicalName=
'$logicalName'
Disadvantage: This import is very inefficient because every record is inserted separately and this cause poor performance.
<!DOCTYPE etl SYSTEM
"https://scriptella.org/dtd/etl.dtd"
>
<etl>
<description>CSV to Neo4j. Insert every record from CSV by own Cypher</description>
<properties>
csv.nodes=/nodes.csv
csv.rels=/rels.csv
</properties>
<connection id=
"csv_nodes"
driver=
"csv"
url=
"$csv.nodes"
>
separator=;
</connection>
<connection id=
"csv_rels"
driver=
"csv"
url=
"$csv.rels"
>
separator=;
</connection>
<connection id=
"neo4j"
driver=
"neo4j"
url=
"bolt://localhost:7687"
user=
"neo4j"
password=
"admin"
/>
<query connection-id=
"csv_nodes"
>
<script connection-id=
"neo4j"
>
MERGE (n:Ci{logicalName:
'$logicalName'
})
SET
n.logicalName=
'$logicalName'
,
n.type=
'$type'
,
n.subtype=
'$subtype'
,
n.comment=
'$comment'
,
n.vendor=
'$vendor'
,
n.responsible=
'$responsible'
,
n._created=timestamp(),
n._updated=timestamp()
</script>
</query>
<query connection-id=
"csv_rels"
>
<script connection-id=
"neo4j"
>
MATCH (n1:Ci{logicalName:
'$sourceLogicalName'
}),(n2:Ci{logicalName:
'$targetLogicalName'
}) MERGE (n1)-[r:RELATED{relationshipName:
'$relationshipName'
}]->(n2)
SET
r.type=
'$type'
,
r.subtype=
'$subtype'
,
r._created=timestamp(),
r._updated=timestamp()
</script>
</query>
</etl>