Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtaining congestion window measurements #5

Open
03k64 opened this issue Feb 7, 2019 · 5 comments
Open

Obtaining congestion window measurements #5

03k64 opened this issue Feb 7, 2019 · 5 comments

Comments

@03k64
Copy link

03k64 commented Feb 7, 2019

I am currently using CCP as part of a research project but I am encountering problems in logging congestion control information correctly, specifically, no logs are produced after around 30 seconds from starting the userspace agent. I am using the reno binary generated from this project, details of modifications to support obtaining measurements are given below.

In impl GenericCongAvoidAlg for Reno in src/reno.rs, a reference to the DatapathInfo struct is passed and additional information about the flow is stored to be logged out later:

fn new_flow(&self, logger: Option<slog::Logger>, info: &DatapathInfo) -> Self::Flow {
    let le_src_ip = info.src_ip.to_be();
    let src_ip = Ipv4Addr::from(le_src_ip);
    let le_dst_ip = info.dst_ip.to_be();
    let dst_ip = Ipv4Addr::from(le_dst_ip);

    Reno {
        logger,
        mss: info.mss,
        init_cwnd: f64::from(info.init_cwnd),
        cwnd: f64::from(info.init_cwnd),
        src_ip: Some(src_ip.to_string()),
        src_port: Some(info.src_port),
        dst_ip: Some(dst_ip.to_string()),
        dst_port: Some(info.dst_port),
    }
}

In impl GenericCongAvoidFlow for Reno in src/reno.rs the following log statement is added in each of the implemented functions in order to provide information about the congestion state of the flow:

self.logger.as_ref().map(|log| {
    info!(log, "curr_cwnd()";
        "curr_cwnd (bytes)" => self.cwnd,
        "mss (bytes)" => self.mss,
        "src_ip" => self.src_ip.as_ref(),
        "src_port" => self.src_port,
        "dst_ip" => self.dst_ip.as_ref(),
        "dst_port" => self.dst_port,
    );
});

In addition, the trait declaration for GenericCongAvoidAlg in src/lib.rs is modified mirroring the implementation detail above:

fn new_flow(&self, logger: Option<slog::Logger>, info: &DatapathInfo) -> Self::Flow;

Some minor modifications were made to src/cubic.rs in order to comply with the modified trait declaration, but these compile successfully and the cubic binary has not been used.

Initially I used the default logging set up which is initialised in src/bin/reno.rs as:

let log = portus::algs::make_logger();

I redirected the standard out and standard error streams to a file and observed the problem with logs stopping after around 30 seconds. I then added the following to src/bin_helper.rs to instead log directly to a file, the chan_size of 2048 was arrived at through trial and error, this figure results in no messages advising of message loss appearing in the resulting file:

pub fn make_logger() -> slog::Logger {
    let now = std::time::SystemTime::now();
    let ts = now.duration_since(std::time::SystemTime::UNIX_EPOCH)
        .expect("Time went backwards");
    let ms_ts = ts.as_millis();

    let log_path = format!("/usr/src/output/server/cwnd/{}.txt", ms_ts);
    
    let file = std::fs::OpenOptions::new()
        .create(true)
        .write(true)
        .truncate(true)
        .open(log_path)
        .unwrap();

    let decorator = slog_term::PlainDecorator::new(file);
    let drain = slog_term::FullFormat::new(decorator).build().fuse();
    let drain = slog_async::Async::new(drain).chan_size(2048).build().fuse();
    slog::Logger::root(drain, o!())
}

So far as I can tell, the main userspace agent is running correctly since the data flows measured at the client side show an approximately equal share of bandwidth being assigned to each but for some reason the log output stops after around 30 seconds. In an attempt to eliminate the slog library as a cause I replaced the above statements with simple println!(...) and eprintln!(...) lines but the same behaviour was observed.

Given this doesn't seem to be a problem with the logging method, is there some behaviour (either here or in portus) that causes the four functions in impl GenericCongAvoidFlow for Reno to no longer be called after a certain point in time?

Additionally, given the example script in the eval-scripts repository uses tcpprobe to measure the congestion window, is there a reason you chose not to instrument the userspace code to obtain this measurement?

I've previously been using the tcp_probe tracepoint to obtain measurements of the congestion window for the kernel datapath using the in-kernel implementations of Reno & Cubic. I'd like to include the QUIC datapath as part of my research and I intended to use the instrumentation detailed above as a single solution for obtaining necessary measurements.

In summary:

  1. Is the absense of further calls to the four functions in the impl GenericCongAvoidFlow for Reno block in src/reno.rs deliberate?
  2. Is there a reason why measurements of the congestion window should not be obtained here?
  3. If there is a reason detailed in answer to question 2, would you be able to provide details of how you obtained congestion window measurements for QUIC in figure 8 in this paper?
  4. If there is not a reason detailed in answer to question 2, would you be able to open to collaborating on a way to obtaining these measurements in the userspace code?
@fcangialosi
Copy link
Member

fcangialosi commented Feb 7, 2019

Hm, that's very strange, there's definitely no reason that the algorithm should stop receiving callbacks after any amount of time. The only reason you should stop receiving callbacks is if the traffic stops.
Could you give us some more details about the experiment you're running?

(We've also used slog as-is in the repo for all of our experiments and haven't run into any issues with it, so I don't think you should need to change anything there. If slog is unable to keep up, it will print a warning message saying so.)

The script you linked to runs a comparison between algorithms on ccp and their in-kernel implementation. We are only using tcpprobe in that script so that the measurements are coming from the same place for the non-ccp case, but it's not a limitation at all. The CCP algorithm is the one setting the congestion window itself, so it doesn't need to be "measured" per se, it can just be printed directly by the algorithm (e.g. reno.rs). To produce Figure 8 we just manually modified QUIC to write the cwnd to a file one per RTT, but again this was just for consistency when comparing against experiments without CCP. In either case, now that we are confident CCP works, we could just be printing it from within algorithm.

Technically there are a few places in the kernel (or more generally any datapath) where it updates the snd_cwnd on its own outside of CCP's control (e.g. here), but CCP is overwriting it so that shouldn't really matter.

@03k64
Copy link
Author

03k64 commented Feb 9, 2019

Sure, my experiments are comparing streaming adaptation mechanisms with respect to various congestion control algorithms.

I did see some warning messages from slog about dropped messages, I gradually increased the .chan_size(...) in the declaration until they stopped. I'm reasonably confident this problem isn't due to slog since I observed the same behaviour when using only println!(...) and eprintln!(...) statements.

That makes sense - it's good to know I should be able to simply print the congestion window from the algorithm. I'll continue to investigate why exactly I'm unable to see the expected output, I'm reasonably sure CCP doesn't unexpectedly terminate since I monitor the pid of the process and I can see from logging on the client side that data continues to be received.

Interestingly, the bitrate received by the client ceases to show any variation from around the same time in the experiment that logging stops - it may be CCP no longer modifies the congestion window after this point in time. I will re-run my experiments using an unmodified version of this library and re-examine the client logs to see if the same behaviour occurs.

@03k64
Copy link
Author

03k64 commented Feb 10, 2019

Having re-examined the client logs generated when re-running my experiments with this library at commit 6f23aa I see the same issue - a lack of variance in the congestion window from very early in the tests. I chose the referenced commit as it is referenced as a submodule in the eval-scripts repository so I presume it is a working version.

Looking at portus I wonder if this issue may be the cause of my problem also?

@03k64
Copy link
Author

03k64 commented Feb 12, 2019

Another update - I believe I have identified the cause of the lack of variance in requested bitrate as a problem with my experimental set up. My only remaining issue is the lack of logs which I believe is possibly related to the issue I linked to in my previous comment.

Let me know if there is anything else I can do to try and help identify the root cause, I'd be happy to contribute to a solution.

@akshayknarayan
Copy link
Member

Hi @darArch,
Could you provide the contents of /var/log/syslog during an event like this, where the CWND stops being updated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants