-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for .epub text extraction? #106
Comments
That's an excellent idea! Have you used any of the npm packages? Which one(s) would you recommend? |
I had a look at the packages, and then realised that it'd make much more sense just to extract the epub ( var AdmZip = require('adm-zip');
var cheerio = require('cheerio');
function epubToText(path) {
let zip = new AdmZip(path);
let zipEntries = zip.getEntries(); // an array of ZipEntry records
let output = "";
// look through all files in zip
for(let entry of zipEntries) {
// get file extension:
let nameParts = entry.entryName.split(".");
let lastPart = nameParts[nameParts.length-1];
if(lastPart === "html" || lastPart === "htm") {
// extract text with cheerio
let $ = cheerio.load( zip.readAsText(entry.entryName) );
output += $("body").text();
}
}
return output;
} There's probably a much more efficient way to do a lot of that but as a prototype it works fine. |
@josephrocca sounds pretty straightforward based on https://en.wikipedia.org/wiki/EPUB : just parse the container.xml, figure out the order of the files, then use cheerio (already a dependency) to process. It's probably better to rely on a module, but if I have time I'll try to whip up something |
@jsinmotion Woops! You're right, I completely forgot about the ordering of the files. Here's a mock-up of what it might look like with proper ordering: var AdmZip = require('adm-zip');
var cheerio = require('cheerio');
function epubToText(path) {
let zip = new AdmZip(path);
let zipEntries = zip.getEntries(); // an array of ZipEntry records
// get content.opf path and containing folder
let $ = cheerio.load( zip.readAsText('META-INF/container.xml') );
let contentOpfPath = $("container rootfiles rootfile").attr("full-path");
let contentOpfFolder = contentOpfPath.split("/")
contentOpfFolder.pop();
contentOpfFolder = contentOpfFolder.join("");
// push html/htm files into our array of paths to convert to text
$ = cheerio.load( zip.readAsText(contentOpfPath) );
let contentFilePaths = [];
$("package manifest item").each((i, el) => {
let path = $(el).attr("href");
let pathParts = path.split(".");
let lastPart = pathParts[pathParts.length-1];
if(lastPart === "html" || lastPart === "htm") {
contentFilePaths.push(contentOpfFolder+"/"+path);
}
});
// extract text from each file with cheerio
let output = "";
for(let path of contentFilePaths) {
let $ = cheerio.load( zip.readAsText(path) );
output += $("body").text();
}
return output;
} And fixed demo: |
Will take a look! No one has clamored for it, and when I looked a year or so ago there weren't a lot of good options available (and I wasn't in the mood to figure it out myself given the lack of interest). |
epub support would be really helpful. I think @josephrocca 's mockup looks quite good already. |
Sorry, this totally fell off my radar. Happy to take a PR! But I may be able to dig in in the next few weeks. |
I'd have thought the
.epub
format would be popular enough to make it into this lib. Is there any reason why it's missing? There are already a few good npm packages for parsing epub files, so it wouldn't be hard to integrate it into textract.Great lib in any case - thanks for releasing it 👍
The text was updated successfully, but these errors were encountered: