Neo4j载入CSV文件例子

显示全部楼层 · 2014-2-19 11:55:14

I spent some time earlier in the week trying to import a CSV file extracted from Hadoop into Neo4j using Cypher’s LOAD CSV command and initially struggled due to some rogue characters.
The CSV file looked like this:
$ cat foo.csv
foo,bar,baz
1,2,3复制代码

千问 · 2014-2-19 11:55:14

I wrote the following LOAD CSV query to extract some of the fields and compare others:load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, line.bar, line.bar = "2"复制代码==> +--------------------------------------+
==> | line.foo | line.bar | line.bar = "2" |
==> +--------------------------------------+
==> || "2" | false
|
==> +--------------------------------------+
==> 1 row复制代码

千问 · 2014-2-19 11:55:14

I had expect to see a “1” in the first column and a ‘true’ in the third column, neither of which happened.
I initially didn’t have a text editor with hexcode mode available so I tried checking the length of the entry in the ‘bar’ field:load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, line.bar, line.bar = "2", length(line.bar)复制代码==> +---------------------------------------------------------+
==> | line.foo | line.bar | line.bar = "2" | length(line.bar) |
==> +---------------------------------------------------------+
==> || "2" | false
| 2
|
==> +---------------------------------------------------------+
==> 1 row复制代码

千问 · 2014-2-19 11:55:14

The length of that value is 2 when we’d expect it to be 1 given it’s a single character.
I tried trimming the field to see if that made any difference…load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, trim(line.bar), trim(line.bar) = "2", length(line.bar)复制代码==> +---------------------------------------------------------------------+
==> | line.foo | trim(line.bar) | trim(line.bar) = "2" | length(line.bar) |
==> +---------------------------------------------------------------------+
==> || "2"
| true
| 2
|
==> +---------------------------------------------------------------------+
==> 1 row复制代码

千问 · 2014-2-19 11:55:14

…and it did! I thought there was probably a trailing whitespace character after the “2” which trim had removed and that ‘foo’ column in the header row had the same issue.
I was able to see that this was the case by extracting the JSON dump of the query via the Neo4j browser:{
"table":{
"_response":{
"columns":[

"line"
],
"data":[

{

"row":[

{

"foo\u0000":"1\u0000",

"bar":"2\u0000",

"baz":"3"

}

],

"graph":{

"nodes":[

],

"relationships":[

]

}

}
],
...
}复制代码

千问 · 2014-2-19 11:55:14

It turns out there were null characters scattered around the file so I needed to pre process the file to get rid of them:$ trbar.csv复制代码Now if we process bar.csv it’s a much smoother process:load csv with headers from "file:/Users/markneedham/Downloads/bar.csv" AS line
RETURN line.foo, line.bar, line.bar = "2", length(line.bar)
复制代码

千问 · 2014-2-19 11:55:14

==> +---------------------------------------------------------+
==> | line.foo | line.bar | line.bar = "2" | length(line.bar) |
==> +---------------------------------------------------------+
==> | "1"| "2"| true | 1
|
==> +---------------------------------------------------------+
==> 1 row
复制代码Note to self: don’t expect data to be clean, inspect it first!

千问 · 2014-2-19 11:55:14

over.