GraphFS

Imagine you drop all your files into a single "bucket," and this file store understands:

How to best store all your files
The content of the files
Finds similar and related files
Establishes data lineage

And much, much more. GraphFS leverages the power of graph and vector databases to accomplish just that.

Running in Docker Container

Install Docker on your platform.

Create the data directory:

mkdir ./volumes/graphfs

This is where graphfs stores data internally, so make sure it has some space.

Create the config file by cloning etc/config.yml.sample and then customize based on your situation:

environments:
  DEV:
    BINSTORE:
      path: /mnt/data
    NEO4J:
      password: binstore
      username: neo4j
local_url: <ip_addr>
milvus_url: <ip_addr>
graphfs_host: localhost
graphfs_port: 9000

Build the GraphFS image from the Dockerfile:

docker build -t graphfs .

Run the GraphFS server in a container. The server runs on port 9000:

docker run -it --rm --name graphfssrv -p 127.0.0.1:9000:9000 -v ./volumes/graphfs:/mnt/data graphfs

Dev Environment Setup

Python 3.12.3

On Windows, Python executable is not python.exe but py.exe

pip 24.0

On Windows, see https://www.geeksforgeeks.org/how-to-install-pip-on-windows/

Python Virtual Environment

python3 -m venv graphfs-env
source graphfs-evn/bin/activate

brew install libmagic
uvicorn main:app --host 0.0.0.0 --port 9000 --reload

Build and run as a Docker image:

docker build -t graphfs .

docker run -it --rm --name graphfssrv -p 127.0.0.1:9000:9000 graphfs

Neo4j Indexes

CREATE INDEX FOR (fn:FileNode) ON fn.sha256
CREATE INDEX FOR (fn:FileNode) ON fn.size
CREATE INDEX FOR (fn:FileNode) ON fn.mime
CREATE INDEX FOR (c:Container) ON c.sha256
CREATE INDEX FOR (c:Container) ON c.size
CREATE INDEX FOR (f:Regular) ON f.name
CREATE INDEX FOR (d:Directory) ON d.name
CREATE INDEX FOR ()-[s:STORED_IN]-() ON s.idx

Cypher to check memory usage:

Useful Cypher Statements

Percentage of Containers similar to other Container:

MATCH (c:Container) WITH COUNT(c) AS total MATCH (c1:Container)-[r:SIMILAR_TO]->(c2:Container) WITH COUNT(r) AS similar, total RETURN similar, total, round(toFloat(similar)/toFloat(total), 2) AS similar_percent

Reverse SIMILAR_TO relationship

MATCH (c1:Container {sha256:"3121dde47289a0b742e4f3e0e28d95e6cc417cc5ff929cea2483447264c68c37"})-[r:SIMILAR_TO]->(c2:Container {sha256:"7c45a86584f192733f2dd8f7c99f3c9d9e127f2643ef71c0ea687ef38d9a63fe"})
MERGE (c2)-[rp:SIMILAR_TO]->(c1) SET rp.delta=r.delta, rp.ctime=r.ctime DELETE r

Find all SIMILAR_TO chains

MATCH (c1:Container)-[:SIMILAR_TO *]->(c2:Container) RETURN c1,c2

List file with the specified MIME type

MATCH (f:Regular)-[:REFERENCES]-(fn:FileNode {mime:"text/x-Algol68"}) WITH f MATCH p=shortestPath((r:Root)-[:HARD_LINK*]-(f:Regular)) RETURN [n in nodes(p) | n.name] AS path ORDER BY path

List the number of file of each MIME type:

MATCH (fn:FileNode) WITH DISTINCT fn.mime AS mime
UNWIND mime AS m
MATCH (fn:FileNode {mime: m})
RETURN m AS mime, COUNT(fn) AS count ORDER BY count DESC

Similarity Progress Stats

MATCH (c:Container) WITH COUNT(c) AS Total
MATCH (c:Container) WHERE c.simsearch IS NOT NULL OR (c)-[:SIMILAR_TO]->(:Container) WITH Total, COUNT(c) AS SimSearched
MATCH (c:Container)-[:SIMILAR_TO]-(:Container) WITH Total, SimSearched, COUNT(DISTINCT c) AS Similar
RETURN Total, SimSearched, round(100.0*SimSearched/Total,2) AS Progress, Similar, round(100.0*Similar/Total,2) AS Similarity

Find Files that share Containers with the given File:

MATCH (f:Regular)-[:REFERENCES]->(fn:FileNode)-[s:STORED_IN]->(c:Container) WHERE elementId(f)="4:e7e7f16b-f67f-4cbe-900f-1aec09af7472:1352649" RETURN fn.sha256, COUNT(c)

MATCH (fn:FileNode {sha256:"b8c56bbfe8ac994db94f5702b48beb9ca64f9d003785388f5dd23fc05d81c932"})-[s:STORED_IN]->(c:Container) WHERE s.idx IN range(0,255) WITH fn, c AS containers, s.idx AS i ORDER BY s.idx UNWIND containers AS c MATCH (c)<-[:STORED_IN]-(t:FileNode) WHERE fn<>t WITH i, c, COUNT(t) AS t WHERE t > 1 RETURN i, c.sha256, t

List N FileNodes that haven't been searched for similarity yet:

MATCH (fn:FileNode) WHERE fn.simsearch IS NULL AND NOT (fn)-[:SIMILAR_TO]-(:FileNode) WITH fn AS fnlist LIMIT 10
UNWIND fnlist AS fn
RETURN fn

Determine if all Containers of a given FileNode have been processed for similarity:

MATCH (fn:FileNode {sha256:"b8c56bbfe8ac994db94f5702b48beb9ca64f9d003785388f5dd23fc05d81c932"})-[:STORED_IN]->(c:Container) WITH fn, COUNT(c) AS total MATCH (fn)-[:STORED_IN]-(c:Container) WHERE c.simsearch IS NOT NULL OR (c)-[:SIMILAR_TO]-(:Container) RETURN fn.sha256, total, COUNT(c) AS processed

Example:

╒══════════════════════════════════════════════════════════════════╤═════╤═════════╕
│fn.sha256                                                         │total│processed│
╞══════════════════════════════════════════════════════════════════╪═════╪═════════╡
│"b8c56bbfe8ac994db94f5702b48beb9ca64f9d003785388f5dd23fc05d81c932"│6076 │1948     │
└──────────────────────────────────────────────────────────────────┴─────┴─────────┘

Select the first {n} FileNodes that haven't been searched for similarity yet but have all its Containers searched for similarity:

MATCH (fn:FileNode) WHERE fn.simsearch IS NULL AND NOT (fn)-[:SIMILAR_TO]-(:FileNode) WITH fn AS fnlist
UNWIND fnlist AS fn
MATCH (fn)-[:STORED_IN]->(c:Container) WHERE c.size=1024 WITH fn, COUNT(c) AS total
MATCH (fn)-[:STORED_IN]-(c:Container) WHERE c.simsearch IS NOT NULL OR (c)-[:SIMILAR_TO]-(:Container)
WITH fn, total, COUNT(c) AS processed WHERE total=processed
RETURN fn.sha256 AS sha256, total, processed LIMIT {n}

For a given FileNode, find all other FileNodes that share Containers with it:

MATCH (fn1:FileNode {sha256:"ff5da9779f55390b4c69847407d74d4436067703a0a8a35865500831044c1b6f"})-[s1:STORED_IN]->(c:Container)<-[:STORED_IN]-(fn2:FileNode {sha256:"a8b24184cb44357671c072be7d64bd260d6e2f683665b4d45417664e44aee724"}) WITH DISTINCT fn1, fn2, s1.idx AS idx, c.sha256 AS c, c.size AS size ORDER BY idx RETURN fn1.size, SUM(size), fn2.size

Then, we can find how many bytes, therefore percent, each FileNode pair have in common:

MATCH (fn1:FileNode {sha256:"ff5da9779f55390b4c69847407d74d4436067703a0a8a35865500831044c1b6f"})-[:STORED_IN]->(c:Container)<-[:STORED_IN]-(fn2:FileNode) WHERE fn1<>fn2 RETURN DISTINCT fn2.sha256

Recommended Neo4j memory config:

neo4j-admin server memory-recommendation

server.memory.heap.initial_size=3g
server.memory.heap.max_size=3g
server.memory.pagecache.size=1g

Demo Data

pip3 install yt-dlp

Check the downloadable file formats:

yt-dlp -F https://youtu.be/iM3kjbbKHQU?si=eyd-etzPFMe2uCkA

Download

yt-dlp -f 399  https://youtu.be/iM3kjbbKHQU?si=eyd-etzPFMe2uCkA

Split the downloaded MP4 into frames (images). See https://youtu.be/GrLQQVL4aKE?si=aPa3b8H4S2NrAsTV

ffmpeg -i Modern\ Graphical\ User\ Interfaces\ in\ Python\ \[iM3kjbbKHQU\].mp4 -filter:v fps=1 frames/%06d.png

The -filter:v fps=1 defines how many frames per second to capture.

Convert PNG's to BMP's:

from PIL import Image
import os
for png in os.listdir('.'):
  bmp = f"{png.split('.')[0]}.bmp"
  Image.open(png).save(bmp)

Replace archive files with their extracted content

for f in *.gz; do (mkdir tmp && cd tmp && tar xvfz ../$f && cd .. && rm $f &&mv tmp $f); done

for f in *.zip; do (mkdir tmp && cd tmp && unzip ../$f && cd .. && rm $f &&mv tmp $f); done

find . -maxdepth 2 -type f -name "accumulo-*.tar.gz" | xargs dirname | sort -u

Hydrate with commits from a Git repo

nohup python3 -u flatten-git-repo.py -d ../demo-data -r https://github.com/hub4j/github-api -m 1000 > /mnt/volumes/graphfs/log/flatten-git-repo.log 2>&1 &

from ~/git/graphfs/demo directory.

Containerize:

nohup python3 -u binstore/src/graphfs/containerizer.py > /mnt/volumes/graphfs/containerizer.log 2>&1

Scrub:

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
aws		aws
binstore		binstore
demo		demo
etc		etc
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.miilvus		Dockerfile.miilvus
Dockerfile.neo4j		Dockerfile.neo4j
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
docker-compose.yml		docker-compose.yml
graphfs-service.bash		graphfs-service.bash
graphfs.service		graphfs.service
milvus.yaml		milvus.yaml
neo4j.conf		neo4j.conf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphFS

Running in Docker Container

Dev Environment Setup

Demo Data

Replace archive files with their extracted content

Hydrate with commits from a Git repo

About

Languages

License

aseriy/graphfs

Folders and files

Latest commit

History

Repository files navigation

GraphFS

Running in Docker Container

Dev Environment Setup

Demo Data

Replace archive files with their extracted content

Hydrate with commits from a Git repo

About

Resources

License

Stars

Watchers

Forks

Languages