Made a few minor tidyups to the TCPPing test. The main change was to pad IPv4 SYNs with 20 bytes of TCP NOOP options to ensure IPv4 and IPv6 tests to the same target will have the same packet size. Otherwise this could get confusing for users when they choose a packet size on the graph modal and find that they can't see IPv6 (or IPv4) results.
Now that we have three AMP tests that measure latency, we decided that it would be best if all of the latency tests could be viewed on the same graph, rather than there being a separate graph for each of DNS, ICMP and TCPPing. This required a fair amount of re-architecting of ampy to support views that span multiple collections -- we now have an 'amp-latency' view that can contain groups from any of the 'amp-dns', 'amp-icmp' and 'amp-tcpping' collections.
Added support for the amp-latency view to the website. The most time-consuming changes were re-designing the modal dialog for choosing which test results to add to an amp-latency graph, as now it needed to support all three latency collections (which all have quite different test options) on the same dialog. It gets quite complicated when you consider that we won't necessarily run all three tests to every target, e.g. no point in running a DNS test to www.wand.net.nz as it isn't a DNS server, so the dialog must ensure that all valid selections and no invalid selections are presented to the user. As a result, there's a lot of hiding and showing of modal components required based on what option the user has just changed.
Managed to get amp-latency views working on the website for the existing amp-icmp and amp-dns collections, but it should be a straightforward task to add amp-tcpping as well.
I haven't made much progress this week. I remembered that before I could get going with writing a CoAP application, I have to actually get the LoWPAN integrated with a proper network (rather than just a PTP link between the 6lbr and some computer), which I had tried earlier and I was confused at the time as to why it didn't just work. I've been trying to figure it out but to no avail.
Since I've just been piggybacking on the redcables network thus far, I think my next step will probably be to take the equipment back home where I have root access to the router and more control over the network configuration.
In happier although not entirely useful news, I did find a nice 802.15.4 packet sniffer implementation for Contiki which specifically works with the stm32w radio (which might have been really helpful a couple of weeks ago). It outputs 802.15.4 frames in pcap format, which can be piped into Wireshark to view a live capture.
The Doubletree simulation using the same data files as the event based simulator from Tony McGregor took 2 weeks to run. It turned out that the stored value for number of probes sent was incorrectly recorded as zero in all cases, so this value was determined using the hop information recorded in the traces. Four runs were started using this fix, to run four categories of simulations using this data. It should once again be a two week cycle.
The data from the Megatree simulator is now quite good, however snapshots for stepwise increases in source count are still incorrectly calibrated. Further steps have been taken to remedy this and another run to gather this data has been initiated.
The black hole detector has finished another cycle and so further post collection analyses are required. Steps are being taken to confirm that the transitory black holes found, withstand the scrutiny of warts dump data evaluation manually.
Tidied up the traceroute stopset code to add addresses in a more
consistent manner regardless of whether an address in the stopset was
found or if the TTL hit 1. This also allowed me to more easily check
that parts of a completed path don't already exist in the stopset (they
might have been added since they were last checked for) to prevent
Added the ability to lookup and report AS numbers for all addresses seen
in the paths (using the Team Cymru data). This currently works for the
standalone test (which doesn't have access to the built-in DNS cache)
but requires some slight modification to run as part of amplet itself.
I've started the week working on the parallel/live formats, mainly ensuring that the statistic counters for them are correct. For this I stored copies of some statistics from the 'format' against the 'trace' itself, this deals with the case the format is paused or stopped when it is closed such as the case with interrupting a live format. Otherwise this would typically be lost because the format is closed and losses its statistics. I also moved the accepted counter into each thread so this didn't require a lock. I still need to verify I'm grabbing the right statistics from the live formats, that is to say summing each threads counters if necessary.
Later in the week I started to look at managing memory better and avoiding excessive calls to malloc and processing packets in batches. This is coming along well :).
I have added a batch size to the reads from a single threaded format, to reduce lock contention and improve performance. Accompanying this I've finally got a nicer generic memory recycler, which is currently used for packets. This takes much of its design from the DPDK memory buffers, they include a thread local cache to reduce accesses to the main pool. However unlike DPDK this is all dynamic and any new thread will automagically create a thread local cache with very little overhead and tidies itself when the thread exits. This also allows the total number of packets to be limited, effectively putting a cap on memory usage, however cation must be applied to ensure this doesn't run out.
This is still work in progress but is showing a significant improvement.
Wrote a script to update an existing NNTSC database to add the necessary tables and columns for storing AS path data. Tested it on my existing test database and will roll it out to prophet once we're collecting AS path data and are sure that our database schema covers everything we want to store.
Added a TCP ping test to AMP. This turned out to be a lot more complicated than I had first anticipated, but I'm reasonably confident that we've got something working now. The test works by sending a TCP SYN to a predefined port on the target and measures how long it takes to get a TCP response (either a SYN ACK or a RST). We can also get an ICMP response, so we need to listen for that and report a failed result in that case. The complications arise in that the operating system typically handles the TCP handshake, so we have to pull a number of tricks to be able to send and receive TCP SYN and SYN ACK packets inside our test code.
Sending a SYN is easy enough using a raw socket, although we have to make sure we bind the source port using a separate socket to prevent the OS from allowing other applications to use it which would screw with our responses. Getting the response is a lot harder -- we have to work out which interface our SYN is going to use and attach a pcap live capture to that interface (filtering on traffic for our known source and dest ports + icmp). We find the interface by creating a UDP socket to our target and seeing which source address it binds to, then check the list of addresses returned by getifaddrs() to find a match. The match will tell us the name of interface that the address belongs to.
Any packets received on our pcap capture are checked to see if they match any of the SYNs that we had sent out. This is done by parsing the packet headers -- I felt dirty writing non-libtrace packet parsing code -- and looking to see if the ACK matched the sequence number of the packet we had sent (in the case of TCP) or if the embedded TCP header matched our original SYN (in the case of ICMP).
The test still has a few annoying limitations due to the nature of firewalls on the Internet these days. I had originally intended to allow the test to vary the packet size by adding payload to the SYN, which is technically legal TCP behaviour, but in testing I found that SYNs with extra payload will often be dropped and we'll get no response. Transparent proxies on the monitor side are also problematic, in that they will pre-emptively respond to SYNs on port 80 and therefore mess with our latency measurement, e.g. the Fortigate here at Waikato does this, which initially made me think I had a bug in my timestamping since I was getting sub-1 ms results for targets I knew were hundreds of ms away.
Deployed the TCP ping test on our Centos VM successfully and was able to collect some test data in a NNTSC database. Also updated netevmon to be able to process TCP ping latency measurements.
I fixed the issue I'd been stuck with and got my motes communicating this week!
I received a response from Mátyás Kiss re stm32w support in 6lbr, but this unfortunately wasn't very helpful. I was certain that I had already set my PAN IDs to be identical between my slip-radio (tunnel mote) and client mote. He mentioned the need for a "driver modification" but I am still not sure what he was referring to by this as he never got back to me again after I asked.
I had assumed that the issue was with the slip-radio application, because I could see using Wireshark that DIOs were being received from the remote mote on the tunnel interface of the raspberry pi, but there was no outbound traffic from the router. I figured the problem had to be to do with the stm32w chip somehow, since the router is already known to work on the pi!
While testing several different configurations with the 6lbr router set to debug verbosely, I noticed the slip-radio would drop packets at regular intervals. I had previously read that packet drop logs were normal to see because of corruption over the wireless medium and they could essentially be ignored, but I hadn't noticed at the time that drops were occurring every 60 seconds, likely due to testing many different configurations and not leaving them each running for long enough to obviously identify a pattern. I realised that the frequency lined up with when DIOs are sent out by the remote mote, but this left me confused for days as to how/why they were being dropped at the slip-radio because the devices' PAN IDs matched.
However I eventually realised that it was neither of the motes that were the issue - there is also a PAN ID hardcoded into the router itself; it turned out that was set in a different place and for some reason inconsistent. I had assumed the slip-radio handled everything to do with the LoWPAN, but the router must formulate the packets to begin with and pass them off to be sent.
Time to do some CoAP!
I have continued to analyse the results from the fast mapping black hole detector. It looks like counting hop distances in the MDA traces with more than one node is the easiest way to determine if a given stop point from Paris traceroute is inside a load balancer. A small number of transitory black holes have been found in the first lot of load balancer black hole data from Planetlab.
Some of the non event based simulators have long run times now. I need to see if there is a way to speed them up, perhaps using pre compiled lists of addresses to avoid the need for sorting. However the sources window analysis is running in a short time because it only uses the many set where the sources occur more than twice. Last time I ran this I found a bug in the calculations so I have fixed that and set it running again. At this stage I can't work on the event based simulator because I need the results from these former simulators first, and also they cannot fit on the same computer at the same time.
Added local stop sets to the traceroute test to record paths near to the
source and prevent them from being reprobed for every single
destination. Due to the highly parallel nature of the test this
initially had only a very minimal impact on the number of probes
required. At a suggestion from Shane I began probing destinations using
a smaller, fixed sized window rather than all at once in order to
populate the stopset early on in the test. With the current destination
list, this reduced the number of probes required by about 30% without
any real impact on the duration of the test.
Spent some time confirming that the results were the same as the
original test produced, and that they matched the results of other
traceroute programs. Found some slightly different behaviours where I
was treating certain ICMP error codes incorrectly, which I fixed.
Started to look at doing optional AS lookups for addresses on the path.
Appears the easiest solution is to have the test itself look them up
before returning results. Using something like the Team Cymru IP to AS
mappings (which are available over DNS) is simple and would make good
use of caching to minimise the number of queries.
Finished the DPDK format register and unregister threads as I talked about in my last report. This means that the parallel DPDK should be thread safe now and still benefit from thread local memory caches, this whole thing is quite dependent on the version of the DPDK library in use, however I don't expect that this part of DPDK is going to change. I still need to provide a public interface to the user so they can register their own threads, but I'll deal with that when it comes up next.
Added the storage for per thread format data against libtrace threads and passed the thread to read_packet. This replaces an old thread lookup function which was unnecessary and slow.
Found and fixed a handful of bugs in my parallel implementation of int:.
Added some tests for live Linux formats (int:, ring: and pcapint:) to vanilla libtrace, I will port these to parallel this week. This uses network namespaces and virtual Ethernet adapters to simulate a direct connection between two ports. From this I found and fixed a bug with the wire length reported by int: when using TRACE_OPTION_SNAPLEN. And fixed a minor bug in pcapint:.
I will work on porting these to test my live parallel formats. I found that I will need to store format stats against the trace when it ends (before it is paused) since a pause may loss this information. I will also look at batch processing packets from single threaded traces to reduce locking overhead.