试爱neo4j

1. 导入demo数据

load csv方式(适合热更新,无需停服务)

1
2
3
4
5
6
7
8
9
10
11
12
13
// demo数据的导入例子

LOAD CSV WITH HEADERS FROM "file:///RetailRecommendationsDemoDataProduct.csv" AS row
MERGE (parent_category:Category {name: row.parent_category})
MERGE (category:Category {name: row.category})
MERGE (category)-[:PARENT_CATEGORY]->(parent_category)
MERGE (p:Product {sku: toString(row.sku)})
SET p.name = row.name,
p.price = toFloat(row.price)
MERGE (p)-[:IN_CATEGORY]->(category)
MERGE (d:Designer {name: row.designer})
MERGE (p)-[:DESIGNED_BY]-(d)
RETURN *;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// 智子推荐增量数据导入例子

LOAD CSV WITH HEADERS FROM "file:///sales_xxxxxxxx.csv" AS row
MERGE (buyer:User {buyer_nick: row.buyer_nick})
MERGE (plat:Platform {platform_code: row.platform_code})
SET plat.platform_name = row.platform_name
MERGE (brand:Brand {brand_id: row.brand_id})
SET brand.brand_name = row.brand_name
MERGE (store:Store {store_id: row.store_id})
SET store.store_name = row.store_name
MERGE (c:Category {category_code: row.category_code})
SET c.category_code = row.category_code
MERGE (s:Season {season: row.season})
SET s.season_name = row.season_name
MERGE (p:Product {platform_code: row.platform_code})
SET p.product_name = row.product_name

MERGE (p)-[:IN_CATEGORY]->(c)
MERGE (p)-[:IN_SEASON]->(s)
MERGE (p)-[:IN_BRAND]-(brand)
MERGE (p)-[:IN_PLATFORM]->(plat)
MERGE (p)-[:IN_STORE]-(store)
MERGE (buyer)-[:BUY]-(p)
MERGE (p)-[:BE_BOUGHT]-(buyer)

RETURN *;

几个值得注意的地方:

  1. load csv的本地path是相对于import路径的(当然也可以传入网络文件路径),所以需要把数据文件复制到对应目录中,例如目录(macOS):/Users/alithink/Library/Application\ Support/Neo4j\ Desktop/Application/neo4jDatabases/database-4a70fe04-c5a9-41f4-9b8e-5e5c52e283dd/installation-3.5.0/import
  2. load csv的速度还是太慢了,不适合较大存量数据的导入场景。但优势在于导入无需停服务,无需重置数据库,适合增量数据的更新导入场景。

neo4j-import方式(需要停服务,重建数据库,速度快如闪电)

智子neo4j数据整理: 在内网环境构建了数据预处理程序。

首先将数据处理为import可以接受的数据格式

1
2
3
4
5
6
7
8
9
10
11
12
# node数据整理

# 筛选数据维度
season_df = sale_df[["season", "season_name"]]
# 打标签
season_df[':LABEL'] = 'Season'
# 避免id出现冲突(neo4j id命名空间不许存在重复项)
season_df["season"] =[ 'season_%i' % i for i in season_df["season"]]
# 指定id列名
season_df = season_df.rename(columns={"season": "season:ID"})
# 去重
season_df = season_df.drop_duplicates()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# relation数据整理

# 筛选数据维度
product_user_df = sale_df[['product_code','buyer_nick', 'quantity']]

# 去除空格(异常数据处理,neo4j会对空格默认忽略)
product_user_df['buyer_nick'] = product_user_df['buyer_nick'].map(str.strip)

# 聚合数据
product_user_df = product_user_df.groupby(['product_code', 'buyer_nick']) \
.agg({
'quantity': 'sum'
}) \
.reset_index()

# 打标签
product_user_df[':TYPE'] = 'BE_BOUGHT'

# 指定start end 以及releation属性
product_user_df = product_user_df.rename(columns={
"product_code": ":START_ID",
"quantity": "buy_num",
"buyer_nick": ":END_ID"
})

product_user_df = product_user_df[[':START_ID','buy_num', ':END_ID', ':TYPE']]

# 排除空数据
product_user_df = product_user_df.dropna()
1
2
3
# 数据导出
season_df.to_csv('neo4j/season.csv',index=False)
user_product_df.to_csv('neo4j/user_product.csv',index=False)

import导入操作

检查项:

  • 首先将数据文件cp到neo4j主目录下的import文件夹下
  • 确认neo4j服务已停止
  • 删除neo4j主目录data/databases/graph.db

执行如下命令

1
./bin/neo4j-admin import --nodes import/brand.csv --nodes import/buyer.csv --nodes import/category.csv --nodes import/platform.csv --nodes import/product.csv --nodes import/season.csv --nodes import/store.csv --relationships import/product_brand.csv --relationships import/product_category.csv --relationships import/product_platform.csv --relationships import/product_season.csv --relationships import/product_store.csv --relationships import/user_product.csv --relationships import/product_user.csv --delimiter "," --array-delimiter "|" --quote "'"

几种导入方式的对比

load csv速度参考

neo4j-import速度参考

2. 查看数据

启动neo4j服务

$NEO4J_HOME/bin/neo4j console

几个值得注意的地方:

  • 外网访问(conf/neo4j.conf):dbms.connector.http.listen_address=0.0.0.0:7474
  • query日志需要单独配置打开。
  • 默认用户名密码neo4j/neo4j(产出db文件不会重置用户名密码)

查看数据结构

1
call db.schema()

demo结果如下:
屏幕快照 2019-01-24 上午11.23.41

智子推荐db结果如下(不知道为啥有的球中间的文字没显示,感觉neo4j的前端还是有bug):
屏幕快照 2019-01-26 下午12.14.44

推荐尝鲜

1
2
3
4
5
// 智子关联规则推荐

match (s_product:Product {product_code: "IG6431"})-[:BE_BOUGHT]->(s_user:User)<-[:BE_BOUGHT]-(rec:Product)
return rec.product_code, count(s_user) as `Score`
order by count(s_user) desc limit 1000

结果如下:
屏幕快照 2019-01-26 下午12.18.51

tips

  • 换行输入shift+回车
  • 切换到换行模式,command+回车执行语句

参考