Added support for the new amp-tcpping test to ampy and amp-web.
Started on yet another major database schema change. This time, we're getting rid of address-based streams for amp collections and instead having one stream per address family per target. For example, instead of having an amp-icmp stream for every google address we observed, we'll just have two: one for ipv4 and one for ipv6.
This will hopefully result in some performance improvements. Firstly, we'll be doing a maximum of 2 inserts per test/source/dest combination, rather than anywhere up to 20 for some targets. We'll also have a lot less streams to search and process when starting up a NNTSC client. Finally, we should save a lot of time when querying for data, as almost all of our use cases were taking the old stream data and aggregating it based on address family anyway. Now our data is effectively pre-aggregated -- we also will have a lot less joins and unions across multiple tables.
By the end of the week, my test NNTSC was successfully collecting and storing data using this new schema. I also had ampy fetching data for amp-icmp and amp-tcpping, with amp-traceroute most of the way towards working. The main complexity with amp-traceroute is that we should be deploying Brendon's AS path traceroute next week, so I'm changing the rainbow graph to fetch AS path data and adding a method to query the IP path data that will support the monitor map graph that was implemented last summer.
Spent a day working on libtrace following some bug reports from Mike Schiffman at Farsight Security. Fixed some tricky bugs that popped up when using BPF filters with the event API.
Deployed the update-less version of NNTSC on skeptic finally. Unfortunately this initially made the performance even worse, as we were trying to keep the last timestamp cache up to date after every message. Changed it so that NNTSC only writes to the cache once every 5 mins of realtime, which seems to have solved the problem. In fact, we are now finally starting to (slowly) catch up on the message queue on skeptic.
Work has been carried out on the introduction of the churn paper. In particular the beginning of the introduction now states the intent of the paper and explains the relevance of the following topics that are discussed. Version control file permissions have also been set up to allow Richard Nelson to commit changes to the paper.
The next round of black hole detector data has been downloaded and processed. This time there have been some transitory black holes and some that persisted longer than the data collection. The next round of data collection has been started.
With the MDA data based Internet simulator for Megatree, the control packet analysis using many sources to few destinations where each source occurs more than twice has been upgraded. In the current run each window of sources receives a full set of global load balancer data as it begins to run Megatree. This is slightly wasteful as some nodes have already received some of this data, so a more complex upgrade has also been developed where each source node only receives the same global set data once. To achieve this a pre-run run to identify the sources in each window is carried out, as the program cannot otherwise know in advance what these will be when running only a single run. The data are stored as lists of grouped source addresses in text files.
Fixed the AS lookups in the traceroute test to ignore RFC1918 addresses.
Wouldn't have been too much of a problem, but the NXDOMAIN responses
were only being cached for 5 minutes and ended up generating too many
extra queries. Also tidied up some checks for stopset membership to use
the TTL to better match, or prevent inserting pointless addresses.
Merged in the timing changes I made the other week into the main branch
so that they can be used.
Had a good meeting with Shane and Brad about where we need to go with
AMP in the next few months. After this, started looking at better ways
to represent and generate test schedule files so they are easier to
understand and edit. Spent some time looking into YAML and how an
example schedule might look, and experimented with the libyaml parser to
see how the data looks.
Spent most of the week working on memory management, which I have now committed. Mostly spent time refactoring the code and fixing bugs.
Got told I could purchase a 10Gig testbed on Friday, I had a look at the Intel 10Gig NICs mainly due to their compatibility with DPDK. The main downside is that they don't support timestamping every packet like the 1Gig cards do. Instead they will only timestamp PTP packets.
Made a few minor tidyups to the TCPPing test. The main change was to pad IPv4 SYNs with 20 bytes of TCP NOOP options to ensure IPv4 and IPv6 tests to the same target will have the same packet size. Otherwise this could get confusing for users when they choose a packet size on the graph modal and find that they can't see IPv6 (or IPv4) results.
Now that we have three AMP tests that measure latency, we decided that it would be best if all of the latency tests could be viewed on the same graph, rather than there being a separate graph for each of DNS, ICMP and TCPPing. This required a fair amount of re-architecting of ampy to support views that span multiple collections -- we now have an 'amp-latency' view that can contain groups from any of the 'amp-dns', 'amp-icmp' and 'amp-tcpping' collections.
Added support for the amp-latency view to the website. The most time-consuming changes were re-designing the modal dialog for choosing which test results to add to an amp-latency graph, as now it needed to support all three latency collections (which all have quite different test options) on the same dialog. It gets quite complicated when you consider that we won't necessarily run all three tests to every target, e.g. no point in running a DNS test to www.wand.net.nz as it isn't a DNS server, so the dialog must ensure that all valid selections and no invalid selections are presented to the user. As a result, there's a lot of hiding and showing of modal components required based on what option the user has just changed.
Managed to get amp-latency views working on the website for the existing amp-icmp and amp-dns collections, but it should be a straightforward task to add amp-tcpping as well.
I haven't made much progress this week. I remembered that before I could get going with writing a CoAP application, I have to actually get the LoWPAN integrated with a proper network (rather than just a PTP link between the 6lbr and some computer), which I had tried earlier and I was confused at the time as to why it didn't just work. I've been trying to figure it out but to no avail.
Since I've just been piggybacking on the redcables network thus far, I think my next step will probably be to take the equipment back home where I have root access to the router and more control over the network configuration.
In happier although not entirely useful news, I did find a nice 802.15.4 packet sniffer implementation for Contiki which specifically works with the stm32w radio (which might have been really helpful a couple of weeks ago). It outputs 802.15.4 frames in pcap format, which can be piped into Wireshark to view a live capture.
The Doubletree simulation using the same data files as the event based simulator from Tony McGregor took 2 weeks to run. It turned out that the stored value for number of probes sent was incorrectly recorded as zero in all cases, so this value was determined using the hop information recorded in the traces. Four runs were started using this fix, to run four categories of simulations using this data. It should once again be a two week cycle.
The data from the Megatree simulator is now quite good, however snapshots for stepwise increases in source count are still incorrectly calibrated. Further steps have been taken to remedy this and another run to gather this data has been initiated.
The black hole detector has finished another cycle and so further post collection analyses are required. Steps are being taken to confirm that the transitory black holes found, withstand the scrutiny of warts dump data evaluation manually.
Tidied up the traceroute stopset code to add addresses in a more
consistent manner regardless of whether an address in the stopset was
found or if the TTL hit 1. This also allowed me to more easily check
that parts of a completed path don't already exist in the stopset (they
might have been added since they were last checked for) to prevent
Added the ability to lookup and report AS numbers for all addresses seen
in the paths (using the Team Cymru data). This currently works for the
standalone test (which doesn't have access to the built-in DNS cache)
but requires some slight modification to run as part of amplet itself.
I've started the week working on the parallel/live formats, mainly ensuring that the statistic counters for them are correct. For this I stored copies of some statistics from the 'format' against the 'trace' itself, this deals with the case the format is paused or stopped when it is closed such as the case with interrupting a live format. Otherwise this would typically be lost because the format is closed and losses its statistics. I also moved the accepted counter into each thread so this didn't require a lock. I still need to verify I'm grabbing the right statistics from the live formats, that is to say summing each threads counters if necessary.
Later in the week I started to look at managing memory better and avoiding excessive calls to malloc and processing packets in batches. This is coming along well :).
I have added a batch size to the reads from a single threaded format, to reduce lock contention and improve performance. Accompanying this I've finally got a nicer generic memory recycler, which is currently used for packets. This takes much of its design from the DPDK memory buffers, they include a thread local cache to reduce accesses to the main pool. However unlike DPDK this is all dynamic and any new thread will automagically create a thread local cache with very little overhead and tidies itself when the thread exits. This also allows the total number of packets to be limited, effectively putting a cap on memory usage, however cation must be applied to ensure this doesn't run out.
This is still work in progress but is showing a significant improvement.
Wrote a script to update an existing NNTSC database to add the necessary tables and columns for storing AS path data. Tested it on my existing test database and will roll it out to prophet once we're collecting AS path data and are sure that our database schema covers everything we want to store.
Added a TCP ping test to AMP. This turned out to be a lot more complicated than I had first anticipated, but I'm reasonably confident that we've got something working now. The test works by sending a TCP SYN to a predefined port on the target and measures how long it takes to get a TCP response (either a SYN ACK or a RST). We can also get an ICMP response, so we need to listen for that and report a failed result in that case. The complications arise in that the operating system typically handles the TCP handshake, so we have to pull a number of tricks to be able to send and receive TCP SYN and SYN ACK packets inside our test code.
Sending a SYN is easy enough using a raw socket, although we have to make sure we bind the source port using a separate socket to prevent the OS from allowing other applications to use it which would screw with our responses. Getting the response is a lot harder -- we have to work out which interface our SYN is going to use and attach a pcap live capture to that interface (filtering on traffic for our known source and dest ports + icmp). We find the interface by creating a UDP socket to our target and seeing which source address it binds to, then check the list of addresses returned by getifaddrs() to find a match. The match will tell us the name of interface that the address belongs to.
Any packets received on our pcap capture are checked to see if they match any of the SYNs that we had sent out. This is done by parsing the packet headers -- I felt dirty writing non-libtrace packet parsing code -- and looking to see if the ACK matched the sequence number of the packet we had sent (in the case of TCP) or if the embedded TCP header matched our original SYN (in the case of ICMP).
The test still has a few annoying limitations due to the nature of firewalls on the Internet these days. I had originally intended to allow the test to vary the packet size by adding payload to the SYN, which is technically legal TCP behaviour, but in testing I found that SYNs with extra payload will often be dropped and we'll get no response. Transparent proxies on the monitor side are also problematic, in that they will pre-emptively respond to SYNs on port 80 and therefore mess with our latency measurement, e.g. the Fortigate here at Waikato does this, which initially made me think I had a bug in my timestamping since I was getting sub-1 ms results for targets I knew were hundreds of ms away.
Deployed the TCP ping test on our Centos VM successfully and was able to collect some test data in a NNTSC database. Also updated netevmon to be able to process TCP ping latency measurements.