Imagine you drop all your files into a single "bucket," and this file store understands:
- How to best store all your files
- The content of the files
- Finds similar and related files
- Establishes data lineage
And much, much more. GraphFS leverages the power of graph and vector databases to accomplish just that.
Install Docker on your platform.
Create the data directory:
mkdir ./volumes/graphfs
This is where graphfs
stores data internally, so make sure it has some space.
Create the config file by cloning etc/config.yml.sample
and then customize based on your situation:
environments:
DEV:
BINSTORE:
path: /mnt/data
NEO4J:
password: binstore
username: neo4j
local_url: <ip_addr>
milvus_url: <ip_addr>
graphfs_host: localhost
graphfs_port: 9000
Build the GraphFS image from the Dockerfile
:
docker build -t graphfs .
Run the GraphFS server in a container. The server runs on port 9000
:
docker run -it --rm --name graphfssrv -p 127.0.0.1:9000:9000 -v ./volumes/graphfs:/mnt/data graphfs
Python 3.12.3
On Windows, Python executable is not python.exe
but py.exe
pip 24.0
On Windows, see https://www.geeksforgeeks.org/how-to-install-pip-on-windows/
Python Virtual Environment
python3 -m venv graphfs-env
source graphfs-evn/bin/activate
brew install libmagic
uvicorn main:app --host 0.0.0.0 --port 9000 --reload
Build and run as a Docker image:
docker build -t graphfs .
docker run -it --rm --name graphfssrv -p 127.0.0.1:9000:9000 graphfs
Neo4j Indexes
CREATE INDEX FOR (fn:FileNode) ON fn.sha256
CREATE INDEX FOR (fn:FileNode) ON fn.size
CREATE INDEX FOR (fn:FileNode) ON fn.mime
CREATE INDEX FOR (c:Container) ON c.sha256
CREATE INDEX FOR (c:Container) ON c.size
CREATE INDEX FOR (f:Regular) ON f.name
CREATE INDEX FOR (d:Directory) ON d.name
CREATE INDEX FOR ()-[s:STORED_IN]-() ON s.idx
Cypher to check memory usage:
Useful Cypher Statements
Percentage of Containers similar to other Container:
MATCH (c:Container) WITH COUNT(c) AS total MATCH (c1:Container)-[r:SIMILAR_TO]->(c2:Container) WITH COUNT(r) AS similar, total RETURN similar, total, round(toFloat(similar)/toFloat(total), 2) AS similar_percent
Reverse SIMILAR_TO relationship
MATCH (c1:Container {sha256:"3121dde47289a0b742e4f3e0e28d95e6cc417cc5ff929cea2483447264c68c37"})-[r:SIMILAR_TO]->(c2:Container {sha256:"7c45a86584f192733f2dd8f7c99f3c9d9e127f2643ef71c0ea687ef38d9a63fe"})
MERGE (c2)-[rp:SIMILAR_TO]->(c1) SET rp.delta=r.delta, rp.ctime=r.ctime DELETE r
Find all SIMILAR_TO chains
MATCH (c1:Container)-[:SIMILAR_TO *]->(c2:Container) RETURN c1,c2
List file with the specified MIME type
MATCH (f:Regular)-[:REFERENCES]-(fn:FileNode {mime:"text/x-Algol68"}) WITH f MATCH p=shortestPath((r:Root)-[:HARD_LINK*]-(f:Regular)) RETURN [n in nodes(p) | n.name] AS path ORDER BY path
List the number of file of each MIME type:
MATCH (fn:FileNode) WITH DISTINCT fn.mime AS mime
UNWIND mime AS m
MATCH (fn:FileNode {mime: m})
RETURN m AS mime, COUNT(fn) AS count ORDER BY count DESC
Similarity Progress Stats
MATCH (c:Container) WITH COUNT(c) AS Total
MATCH (c:Container) WHERE c.simsearch IS NOT NULL OR (c)-[:SIMILAR_TO]->(:Container) WITH Total, COUNT(c) AS SimSearched
MATCH (c:Container)-[:SIMILAR_TO]-(:Container) WITH Total, SimSearched, COUNT(DISTINCT c) AS Similar
RETURN Total, SimSearched, round(100.0*SimSearched/Total,2) AS Progress, Similar, round(100.0*Similar/Total,2) AS Similarity
Find Files that share Containers with the given File:
MATCH (f:Regular)-[:REFERENCES]->(fn:FileNode)-[s:STORED_IN]->(c:Container) WHERE elementId(f)="4:e7e7f16b-f67f-4cbe-900f-1aec09af7472:1352649" RETURN fn.sha256, COUNT(c)
MATCH (fn:FileNode {sha256:"b8c56bbfe8ac994db94f5702b48beb9ca64f9d003785388f5dd23fc05d81c932"})-[s:STORED_IN]->(c:Container) WHERE s.idx IN range(0,255) WITH fn, c AS containers, s.idx AS i ORDER BY s.idx UNWIND containers AS c MATCH (c)<-[:STORED_IN]-(t:FileNode) WHERE fn<>t WITH i, c, COUNT(t) AS t WHERE t > 1 RETURN i, c.sha256, t
List N FileNodes that haven't been searched for similarity yet:
MATCH (fn:FileNode) WHERE fn.simsearch IS NULL AND NOT (fn)-[:SIMILAR_TO]-(:FileNode) WITH fn AS fnlist LIMIT 10
UNWIND fnlist AS fn
RETURN fn
Determine if all Containers of a given FileNode have been processed for similarity:
MATCH (fn:FileNode {sha256:"b8c56bbfe8ac994db94f5702b48beb9ca64f9d003785388f5dd23fc05d81c932"})-[:STORED_IN]->(c:Container) WITH fn, COUNT(c) AS total MATCH (fn)-[:STORED_IN]-(c:Container) WHERE c.simsearch IS NOT NULL OR (c)-[:SIMILAR_TO]-(:Container) RETURN fn.sha256, total, COUNT(c) AS processed
Example:
╒══════════════════════════════════════════════════════════════════╤═════╤═════════╕
│fn.sha256 │total│processed│
╞══════════════════════════════════════════════════════════════════╪═════╪═════════╡
│"b8c56bbfe8ac994db94f5702b48beb9ca64f9d003785388f5dd23fc05d81c932"│6076 │1948 │
└──────────────────────────────────────────────────────────────────┴─────┴─────────┘
Select the first {n} FileNodes that haven't been searched for similarity yet but have all its Containers searched for similarity:
MATCH (fn:FileNode) WHERE fn.simsearch IS NULL AND NOT (fn)-[:SIMILAR_TO]-(:FileNode) WITH fn AS fnlist
UNWIND fnlist AS fn
MATCH (fn)-[:STORED_IN]->(c:Container) WHERE c.size=1024 WITH fn, COUNT(c) AS total
MATCH (fn)-[:STORED_IN]-(c:Container) WHERE c.simsearch IS NOT NULL OR (c)-[:SIMILAR_TO]-(:Container)
WITH fn, total, COUNT(c) AS processed WHERE total=processed
RETURN fn.sha256 AS sha256, total, processed LIMIT {n}
For a given FileNode, find all other FileNodes that share Containers with it:
MATCH (fn1:FileNode {sha256:"ff5da9779f55390b4c69847407d74d4436067703a0a8a35865500831044c1b6f"})-[s1:STORED_IN]->(c:Container)<-[:STORED_IN]-(fn2:FileNode {sha256:"a8b24184cb44357671c072be7d64bd260d6e2f683665b4d45417664e44aee724"}) WITH DISTINCT fn1, fn2, s1.idx AS idx, c.sha256 AS c, c.size AS size ORDER BY idx RETURN fn1.size, SUM(size), fn2.size
Then, we can find how many bytes, therefore percent, each FileNode pair have in common:
MATCH (fn1:FileNode {sha256:"ff5da9779f55390b4c69847407d74d4436067703a0a8a35865500831044c1b6f"})-[:STORED_IN]->(c:Container)<-[:STORED_IN]-(fn2:FileNode) WHERE fn1<>fn2 RETURN DISTINCT fn2.sha256
Recommended Neo4j memory config:
neo4j-admin server memory-recommendation
server.memory.heap.initial_size=3g
server.memory.heap.max_size=3g
server.memory.pagecache.size=1g
pip3 install yt-dlp
Check the downloadable file formats:
yt-dlp -F https://youtu.be/iM3kjbbKHQU?si=eyd-etzPFMe2uCkA
Download
yt-dlp -f 399 https://youtu.be/iM3kjbbKHQU?si=eyd-etzPFMe2uCkA
Split the downloaded MP4 into frames (images). See https://youtu.be/GrLQQVL4aKE?si=aPa3b8H4S2NrAsTV
ffmpeg -i Modern\ Graphical\ User\ Interfaces\ in\ Python\ \[iM3kjbbKHQU\].mp4 -filter:v fps=1 frames/%06d.png
The -filter:v fps=1
defines how many frames per second to capture.
Convert PNG's to BMP's:
from PIL import Image
import os
for png in os.listdir('.'):
bmp = f"{png.split('.')[0]}.bmp"
Image.open(png).save(bmp)
for f in *.gz; do (mkdir tmp && cd tmp && tar xvfz ../$f && cd .. && rm $f &&mv tmp $f); done
for f in *.zip; do (mkdir tmp && cd tmp && unzip ../$f && cd .. && rm $f &&mv tmp $f); done
find . -maxdepth 2 -type f -name "accumulo-*.tar.gz" | xargs dirname | sort -u
nohup python3 -u flatten-git-repo.py -d ../demo-data -r https://github.com/hub4j/github-api -m 1000 > /mnt/volumes/graphfs/log/flatten-git-repo.log 2>&1 &
from ~/git/graphfs/demo
directory.
Containerize:
nohup python3 -u binstore/src/graphfs/containerizer.py > /mnt/volumes/graphfs/containerizer.log 2>&1
Scrub: