Skip to content

Commit

Permalink
option --media-only for the cli
Browse files Browse the repository at this point in the history
  • Loading branch information
sokomishalov committed May 15, 2020
1 parent ce6b6b8 commit 4e36776
Show file tree
Hide file tree
Showing 63 changed files with 1,938 additions and 460 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ target/
.sts4-cache

### IntelliJ IDEA ###
.idea
.idea/*
!/.idea/runConfigurations
*.iws
*.iml
*.ipr
Expand Down
115 changes: 100 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,14 @@ Current list of implemented sources:
- [Pikabu](https://pikabu.ru)

# Bugs
Unfortunately, each web-site is subject to change without any notice so the tool may work incorrectly because of that.
Unfortunately, each web-site is subject to change without any notice, so the tool may work incorrectly because of that.
If that happens, please let me know via an issue or some message.

# Cli tool
Cli tool allows to:
- download media with flag `--media-only` from almost all presented sources.
- scrape posts meta information

Requirements:
- Java: 1.8 +
- Maven (optional)
Expand All @@ -48,32 +52,61 @@ Usage:
```

```text
usage: [-h] PROVIDER PATH [-n LIMIT] [-t TYPE] [-o OUTPUT]
usage: [-h] PROVIDER PATH [-n LIMIT] [-t TYPE] [-o OUTPUT] [-m]
[--parallel-downloads PARALLEL_DOWNLOADS]
optional arguments:
-h, --help show this help message and exit
-h, --help show this help message and exit
-n LIMIT, --limit LIMIT posts limit (50 by default)
-t TYPE, --type TYPE output type, options: [log, csv, json, xml, yaml]
-n LIMIT, posts limit (50 by default)
--limit LIMIT
-o OUTPUT, --output OUTPUT output path
-t TYPE, output type, options: [log, csv, json, xml, yaml]
--type TYPE
-m, --media-only scrape media only
-o OUTPUT, output path
--output OUTPUT
--parallel-downloads PARALLEL_DOWNLOADS amount of parallel downloads for media items if
enabled flag --media-only (4 by default)
positional arguments:
PROVIDER skraper provider, options: [facebook, instagram, twitter, youtube, twitch, reddit,
ninegag, pinterest, flickr, tumblr, ifunny, vk, pikabu]
PROVIDER skraper provider, options: [facebook, instagram,
twitter, youtube, twitch, reddit, ninegag, pinterest,
flickr, tumblr, ifunny, vk, pikabu]
PATH path to user/community/channel/topic/trend
usage: [-h] PROVIDER PATH [-n LIMIT] [-t TYPE] [-o OUTPUT] [-m]
[--parallel-downloads PARALLEL_DOWNLOADS]
optional arguments:
-h, --help show this help message and exit
-n LIMIT, --limit LIMIT posts limit (50 by default)
-t TYPE, --type TYPE output type, options: [log, csv, json, xml, yaml]
-o OUTPUT, --output OUTPUT output path
-m, --media-only scrape media only
PATH path to user/community/channel/topic/trend
--parallel-downloads PARALLEL_DOWNLOADS amount of parallel downloads for media items if
enabled flag --media-only (4 by default)
positional arguments:
PROVIDER skraper provider, options: [facebook, instagram,
twitter, youtube, twitch, reddit, ninegag, pinterest,
flickr, tumblr, ifunny, vk, pikabu]
PATH path to user/community/channel/topic/trend
```

Examples:
```bash
./skraper ninegag /hot
./skraper reddit /r/memes -n 5 -t csv -o ./reddit/posts
./skraper youtube /user/JetBrainsTV/videos --media-only -n 2
```

# Kotlin Library
Expand All @@ -90,7 +123,7 @@ Maven:
<dependency>
<groupId>com.github.sokomishalov.skraper</groupId>
<artifactId>skrapers</artifactId>
<version>0.3.0</version>
<version>0.4.0</version>
</dependency>
</dependencies>
```
Expand All @@ -101,7 +134,7 @@ repositories {
maven { url("https://jitpack.io") }
}
dependencies {
implementation("com.github.sokomishalov.skraper:skrapers:0.3.0")
implementation("com.github.sokomishalov.skraper:skrapers:0.4.0")
}
```

Expand Down Expand Up @@ -136,8 +169,8 @@ To use them you just have to put required dependencies in the classpath.

Current http-client implementation list:
- [DefaultBlockingClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/jdk/DefaultBlockingSkraperClient.kt) - simple java.net.* blocking api implementation
- [ReactorNettySkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/reactornetty/ReactorNettySkraperClient.kt) - [reactor-netty](https://mvnrepository.com/artifact/io.projectreactor.netty/reactor-netty) implementation
- [OkHttp3SkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/okhttp3/OkHttp3SkraperClient.kt) - [okhttp3](https://mvnrepository.com/artifact/com.squareup.okhttp3/okhttp) implementation
- [ReactorNettySkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/reactornetty/ReactorNettySkraperClient.kt) - [reactor-netty](https://mvnrepository.com/artifact/io.projectreactor.netty/reactor-netty) implementation
- [SpringReactiveSkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/spring/SpringReactiveSkraperClient.kt) - [spring-webflux client](https://mvnrepository.com/artifact/org.springframework/spring-webflux) implementation
- [KtorSkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/ktor/KtorSkraperClient.kt) - [ktor-client-jvm](https://mvnrepository.com/artifact/io.ktor/ktor-client-core-jvm) implementation

Expand All @@ -150,6 +183,7 @@ interface Skraper {
suspend fun getProviderInfo(): ProviderInfo?
suspend fun getPageInfo(path: String): PageInfo?
suspend fun getPosts(path: String, limit: Int = DEFAULT_POSTS_LIMIT): List<Post>
suspend fun resolve(media: Media): Media
}
```

Expand Down Expand Up @@ -238,6 +272,57 @@ Output:
}
```

### Resolve provider relative url
Sometimes you need to know direct media link:
```kotlin
fun main() = runBlocking {
val skraper = InstagramSkraper()
val info = skraper.resolve(Video(url = "https://www.instagram.com/p/B-flad2F5o7/"))
println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(info))
}
```

Output:
```json5
{
"url" : "https://scontent-amt2-1.cdninstagram.com/v/t50.2886-16/91508191_213297693225472_2759719910220905597_n.mp4?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_cat=104&_nc_ohc=27bC52qar_oAX-7J2Zh&oe=5EC0BC52&oh=0aafee2860c540452b76e7b8e336147d",
"aspectRatio" : 0.8010012515644556,
"thumbnail" : {
"url" : "https://scontent-amt2-1.cdninstagram.com/v/t51.2885-15/e35/91435498_533808773845524_5302421141680378393_n.jpg?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_cat=100&_nc_ohc=8gPAcByc6YAAX_kDBWm&oh=5edf6b9d90d606f9c0e055b7dbcbfa45&oe=5EC0DDE8",
"aspectRatio" : 0.8010012515644556
}
}
```

### Download media
There is "static" method which allows to download any media from all known implemented sources:
```kotlin
fun main() = runBlocking {
val tmpDir = Files.createTempDirectory("skraper").toFile()

val testVideo = Skraper.download(
media = Video("https://youtu.be/fjUO7xaUHJQ"),
destDir = tmpDir,
filename = "Gandalf"
)

val testImage = Skraper.download(
media = Image("https://www.pinterest.ru/pin/89509111320495523/"),
destDir = tmpDir,
filename = "Do_no_harm"
)

println(testVideo)
println(testImage)
}
```

Output:
```log
/var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Gandalf.mp4
/var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Do_no_harm.jpg
```

### Scrape provider logo
It is also possible to scrape provider info for some purposes:

Expand Down
6 changes: 3 additions & 3 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
### TODO list
- [ ] Thumbnails to the videos
- [ ] Add option to the cli tool to download media
### Roadmap
- [ ] Telegram bot
- [ ] Client gui
- [ ] Replace java.time.* from jdk 1.8 to lower jdk date-time api
to be more android-friendly.
- [ ] Implement [LinkedIn](https://linkedin.com) - branch (origin/feature/linkedin)
- [ ] Implement [Snapchat stories](https://story.snapchat.com/) - branch (origin/feature/snapchat)
- [ ] Implement [Imgur](https://imgur.com/)
- [ ] Implement [Tiktok](https://tiktok.com) - branch (origin/feature/tiktok)
23 changes: 9 additions & 14 deletions cli/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
<parent>
<groupId>ru.sokomishalov.skraper</groupId>
<artifactId>skraper-parent</artifactId>
<version>0.3.0</version>
<version>${revision}</version>
</parent>

<artifactId>cli</artifactId>
<version>0.3.0</version>
<version>${revision}</version>

<properties>
<maven.skip.deploy>true</maven.skip.deploy>
Expand All @@ -21,7 +21,7 @@
<dependency>
<groupId>ru.sokomishalov.skraper</groupId>
<artifactId>skrapers</artifactId>
<version>0.3.0</version>
<version>${revision}</version>
</dependency>
<dependency>
<groupId>com.xenomachina</groupId>
Expand All @@ -33,6 +33,11 @@
<artifactId>kolor</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>io.ktor</groupId>
<artifactId>ktor-client-cio</artifactId>
<version>${ktor.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-kotlin</artifactId>
Expand All @@ -53,16 +58,6 @@
<artifactId>jackson-dataformat-yaml</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>io.ktor</groupId>
<artifactId>ktor-client-core-jvm</artifactId>
<version>${ktor.version}</version>
</dependency>
<dependency>
<groupId>io.ktor</groupId>
<artifactId>ktor-client-cio</artifactId>
<version>${ktor.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.datatype</groupId>
<artifactId>jackson-datatype-jdk8</artifactId>
Expand All @@ -84,7 +79,7 @@
<artifactId>kotlin-maven-plugin</artifactId>
<version>${kotlin.version}</version>
<configuration>
<jvmTarget>1.8</jvmTarget>
<jvmTarget>${java.version}</jvmTarget>
</configuration>
<executions>
<execution>
Expand Down
38 changes: 31 additions & 7 deletions cli/src/main/kotlin/ru/sokomishalov/skraper/cli/Args.kt
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,28 @@ package ru.sokomishalov.skraper.cli

import com.xenomachina.argparser.ArgParser
import com.xenomachina.argparser.default
import ru.sokomishalov.skraper.cli.OutputType.LOG
import ru.sokomishalov.skraper.Skraper
import ru.sokomishalov.skraper.cli.Serialization.LOG
import ru.sokomishalov.skraper.client.ktor.KtorSkraperClient
import ru.sokomishalov.skraper.knownList
import ru.sokomishalov.skraper.name
import java.io.File

class Args(parser: ArgParser) {
val provider by parser.positional(

companion object {
internal val DEFAULT_CLIENT = KtorSkraperClient()
internal val SKRAPERS = Skraper.knownList(client = DEFAULT_CLIENT)
}

val skraper by parser.positional(
name = "PROVIDER",
help = "skraper provider, options: ${Provider.values().contentToString().toLowerCase()}"
) { Provider.valueOf(toUpperCase()) }
help = "skraper provider, options: ${SKRAPERS.joinToString { it.name }}"
) {
SKRAPERS
.find { this == it.name }
.let { requireNotNull(it) { "Unknown provider" } }
}

val path by parser.positional(
name = "PATH",
Expand All @@ -34,15 +48,25 @@ class Args(parser: ArgParser) {
val amount by parser.storing(
"-n", "--limit",
help = "posts limit (50 by default)"
) { toInt() }.default(50)
) { toInt() }.default { 50 }

val outputType by parser.storing(
"-t", "--type",
help = "output type, options: ${OutputType.values().contentToString().toLowerCase()}"
) { OutputType.valueOf(toUpperCase()) }.default(LOG)
help = "output type, options: ${Serialization.values().joinToString().toLowerCase()}"
) { Serialization.valueOf(toUpperCase()) }.default { LOG }

val output by parser.storing(
"-o", "--output",
help = "output path"
) { File(this) }.default { File("") }

val onlyMedia by parser.flagging(
"-m", "--media-only",
help = "scrape media only"
)

val parallelDownloads by parser.storing(
"--parallel-downloads",
help = "amount of parallel downloads for media items if enabled flag --media-only (4 by default)"
) { toInt() }.default { 4 }
}
Loading

0 comments on commit 4e36776

Please sign in to comment.