A chat from opn:p2p-hackers on the attack-resistance of Google PageRank, anonymous remailer networks, and lots of stuff in between. [2002-03-02]

<raph> in any case, i spent quite a bit of time today analyzing google's attack resistance
<raph> motivated in large part by this scieno thing
<AaronSw> ooh, do share
<tav> gpl + commercial ?
<AaronSw> like sleepycat
<AaronSw> GPLing your software allows you to sell it to companies
<raph> also like ghostscript :)
<raph> well, has anyone here read the pagerank paper?
<AaronSw> Yeah, but I assume that things have been greatly changed since then.
<raph> i don't think so
<raph> it's hard to say for sure, but the original PR is very smart stuff
<raph> and my analysis indicates that it has a good attack-resistance property
<raph> the other thing about PR is that it's quite tunable
<raph> through the E(u) mechanism
<tav> E(u) mechanism ?
<raph> the paper talks about this a little, but obviously they had only started thinking about it
<raph> if i had to guess, it would be more or less vanilla PR with a well-tuned E(u)
<tav> .google pagerank e(u)
<xena> pagerank e(u): http://directory.google.com/Top/Arts/Music/Magazines_and_E-zines/U
<raph> heh, no
<raph> you need the 1998 paper by page
<raph> copy at http://www.levien.com/thesis/refs/pageranksub.ps.gz
<raph> (url typed from memory, lemme know if it 404s)
<tav> works, thanks
<AaronSw> aka http://www-db.stanford.edu/~backrub/pageranksub.ps
<raph> oh good
<raph> so i can explain the central concept in a few minutes, if you'd like
<tav> please
<raph> ok
<raph> so the PR algorithm can be modelled as a random walk
<raph> essentially, what's in the page paper is the following:
<raph> pick a random web page from your list of 3 billion or whatever urls
<raph> pick a link at random, and follow it
<raph> now, roll a 6-sided die
<raph> if it's 6, stop, otherwise go to "pick a link"
<raph> so you have a nice exponential tail distribution of walk lengths
<raph> ok, so now when you stop, mark the url you stop on
<raph> after a huge number of runs, the number of marks on a url corresponds to its rank
<raph> it's a beautiful and simple idea
<raph> E(u) corresponds to the initial step of picking a page at random
<raph> this _doesn't_ have to be a uniform distribution
<raph> and therein lies the interesting part :)
<raph> so, here's my analysis
<raph> (assuming i haven't put you guys to sleep yet)
<tav> listening
<raph> you do the same thing as the analysis of the advogato trust metric
* AaronSw fires up MacGhostViewX and skims the paper
<raph> all web pages are either "good" or "bad"
<raph> the algorithm doesn't know, only santa claus
<raph> the idea is to see if you can prove something about bad pages having low rank and good pages a high rank
<raph> that's the goal of a good attack-resistant algorithm
<tav> please define 'good' / 'bad'
<raph> ah, ok
<raph> that's an interesting question
<raph> the mathematical part of the analysis doesn't care _how_ you define it
<raph> so there are a couple of interesting ways of doing it
<raph> one is by how interesting/relevant the page is
<raph> this is very subjective, but is still the goal of the thing
<raph> another is by assuming an attack
<tav> so, the trick is to define E(u) in such a way so that page rank correlates with your concept of "good" ?
<raph> so the "bad" pages are the ones under the attacker's control
<raph> in this case, the question is: how well can the algo resist assigning a high rank to pages under an attacker's control
<tav> *nod*
<raph> straight keyword searching, for example, has no attack-resistance
<raph> all an attacker needs to do is put the juicy terms in the searchable part of the page, and bingo
<raph> so, the no-attack case is also interesting
<raph> ie, assuming non-adversarial producers of web pages, how can you get the most relevant results?
<raph> this is a question that a lot of people have thought about
<raph> but i'm only really considering the attack case
<raph> an interesting hypothesis is that an attack resistant ranking algo will give good subjective results
<raph> in the non-attack case as well,
<raph> but that would be hard to substantiate :)
<AaronSw> Interesting. Conventional wisdom would seem to indicate that E is based on how often a page changes and the pagerank of the pages that it links to...
<raph> ok, forget e for now
<raph> one of the interesting things is that PR is not hugely sensitive to E
<tav> AaronSw: yea, i thought pagerank was more complex than this
<raph> they ran it with a uniform distribution and also a single start page
<raph> and got results that were surprisingly not all that different
<raph> i am simplifying a bit
<raph> but not _that_ much
<tav> assumed that
<tav> anyways, back to the attack resistance
<raph> yes
<raph> so, as an intermediate step in the analysis, assume that E is only good pages
<raph> or, more precisely, is 0 for all bad pages
<tav> k
<raph> obviously, if you have this assumption, there is a trivially attack resistant ranking algo
<raph> just pick a page from E :)
<tav> heh, how resistant is E though ;p
<raph> but we'll assume that we're doing PR, and it's doing the random walk thing
<raph> tav: the assumption is 100%
<raph> ok, so now we further subdivide the good pages into "good" and "confused"
<raph> a confused page is one that is itself good, but contains a link to a bad page
<tav> right
<raph> actually, it might make more sense to talk about the links
* tav sees now
<raph> good links, evil links, and confused links
<raph> which are g->g, b->whatever, and g->b, resp.
<raph> so in this framework, the analysis is actually easy
<raph> any given walk that crosses a confused or evil link can result in a bad page
<raph> but a walk that crosses only good links results in a good page
<tav> devilishly simple and elegant
<raph> so, for a walk of length k, and the probability of a page containing a confused link p, you get a chance of a good link at least p^k
<raph> (1-p)^k sorry
<raph> what's more interesting is that it's not just the probability of a random good page containing a confused link, it's biased toward high-ranked pages
<tav> ?
<tav> i didn't get that last line
<raph> ok, lemme spell that out, 'coz it's interesting
<raph> assume that we have 'god's web'
<raph> consisting of only the good pages
<raph> now, we run PR on this dataset, resulting in a rank for each page
<raph> (actually, as a minor technicality, you want a rank for each walk length)
<raph> but close enough :)
<tav> heh, k
<raph> i mean, there are going to be some "good" pages that are boring and nobody links to
<raph> and others, such as advogato, that get a huge rank
<raph> (advogato gets a _lot_ of google-juice)
<raph> so now, phase back into the real world
<raph> which contains G's W as a subset
<tav> (only 7 ;p)
<raph> so now we can talk about the real-world rank in terms of god's rank
<raph> g's r being the output of the PR algorithm when run over g's w
<tav> sure
<raph> an attack-resistant ranking algo would have bounds
<raph> ok, so here's the relationship:
<tav> but, google doesn't have a stale control set afaict?
<raph> at each step in the walk, you look at the probabiity of a confused set, weighted by the rank at that step
<raph> confused link, sorry
<raph> weighted by the g's rank at that step
<raph> and, for a walk of length k, the final probability of a good link is the product, over step index [0..k) of 1-(confused prob at that step)
<raph> so, under the assumption that high-ranked pages are less likely to have confused links, you'll actually do better than (1-p)^k
<tav> hmz
<raph> shall we look at E now?
<tav> k...
<raph> ok, so the effect on the ranking results is clear
<raph> you just multiply in 1-(the probability that a page in E is bad)
<raph> ...to the probability that you end up with a good rank
<tav> huh?
<raph> so the implications are quit clear
<raph> ok, a little slower...
<tav> surely that's (probability of a good rank)^2 ?
<raph> E now contains both good and bad pages
<raph> so, choosing a page from E, if it's bad, you get a bad page at the end of the walk
<raph> if it's good, you get a good page with exactly the same probability as if you had started from god's E
<raph> so the real-world good prob = E's good prob * the good prob of running PR over god's E
<tav> right
<raph> the latter meaning running PR over the real web, starting with god's e
<raph> so: "how to optimize google" is really clear now
<raph> E is your whitelist
<raph> ie, you try to keep bad pages out of it
<raph> and it can be much smaller
<raph> than the whole web
<raph> so, the _other_ tunable parameter is that 1/6 chance of stopping the random walk
<raph> (the paper gives 15%, but close enough)
<raph> and now the meaning of that is clear as well
<tav> any reason for the value?
<raph> yes
<raph> the larger it is, the better the attack resistance
<raph> but the smaller it is, the more of the web it covers
<raph> ie, with a value of 1, you have exactly e
<raph> but if e is a whitelist of a few thousand pages, then you're not doing a good job
<raph> so you assume that most interesting pages are within a few hops of a page in e
<raph> and the 15% value corresponds to "a few hops"
*** dev0 has quit IRC (Read error: 110 (Connection timed out))
<raph> ie, at 15%, half of your walks are <= (>) 4 hops
<raph> i'm sorry, but this is fucking brilliant
<raph> (google's original idea)
<tav> heh. i saw the brilliance earlier, but you lost me towards the latter half
<raph> it's _really_ attack resistant,
<raph> and it has a nice way of getting a lot of leverage from human input
<tav> you going to write up your analysis?
<raph> yes indeedy
<raph> chapter 6 of ye olde thessys
<tav> i'd love to read it after reading the original paper
<raph> and, in around 2019 or so, we'll all be able to implement it
<raph> (assuming they get a valid patent)
<raph> i'm fishing for people to tell me that :)
* raph finds it difficult to find time for the thesis with all the other interesting bits of work & play
<raph> so the more motivation, the better :)
<tav> hehe
<tav> well, i've always like reading your work
*** xena has quit IRC ("you son of a bitch")
*** xena (xena@mewtwo.espnow.com) has joined the channel
<raph> ok, so now i'm thinking about something else
<raph> which is the relationship between PR and advogato
<raph> there are a lot of similarities:
<raph> they both work from the same graph structure
<raph> (the fact that nodes in advo are people, and nodes in PR are web pages, is irrelevant to the math)
<tav> *nod*
<raph> and they both have an attack resistant property
<raph> ...which is probably quite similar
<raph> ("resistance", pardon my syntax)
<raph> ie, the class of attacks resisted by advo and pr are probably about the same
<raph> but there is a fundamental difference
<raph> network flow vs random walks
<raph> (the evaluation of pr is not done with walks, it's done with some fun linear algebra)
<raph> so advo gives you a yes/no, but it's possible to enrich that by doing several runs
<raph> which is what master/journeyman/apprentice is all about
<raph> that gives you a [0..4) range
<raph> pr, by contrast gives you a real-valued rank
<raph> perhaps the deepest difference is that pr is deterministic
<raph> and, even more so, not sensitive to small perturbations
<raph> two runs of pr on very similar graphs will give you very similar ranks
<raph> as i'm sure you guys have noticed with actual google searches at different times
<raph> it's a feature; advo's nondeterminism is a misfeature
<tav> oh?
<tav> hmz
<raph> but it does raise some interesting questions; most notably, how would the algos perform on each other's turf?
<raph> running pr on the advo graph is very, very very doable
<raph> the graph is only a few thousand nodes, it would be no sweat to just compute the principal eigenvalue, probably with an off-the-shelf linear algebra package
<tav> not being overly familiar with either, i feel i'm presently ill equipped to answer that
<tav> what's your email address?
<raph> raph@acm.org among others
<raph> that's my "academic" address
<raph> running the advo algorithm on google's graph presents a bigger challenge
<tav> i should join the acm sometime. now, if only i had some money
<raph> i mean, they have a really impressive supercomputer to implement pr, and no doubt a super-nicely tuned impl
<raph> the pubs are not that interesting :(
* tav mentions the google contest to raph
<tav> .google programming contest google
<xena> programming contest google: http://www.google.com/programming-contest
<raph> yeah, i would win, wouldn't i?
<tav> ;p
<tav> most blatantly
<raph> but the program actually has to run to completion, yes?
* raph wonders if he's the only person who sees that as a blatant recruitment scheme
<tav> recruitment / r&d
<raph> oh, only 900k pages, that's interesting
<raph> 'coz advo evaluates 7k pages in about 60s on a modest computer
<tav> google aren't the first to do a contest
<raph> (modest by today's standards)
<raph> the advo impl is in c and pretty well tuned
<tav> anyways, i'll be sure to email you my thoughts as soon as i clear the pagerank paper
<raph> i think a 900k dataset might be doable in an overnight run
<raph> s/7k pages/7k nodes/
<raph> nonetheless, you have tempted me
* raph has no interest in being recruited by google
<raph> but this looks easy and fun
<raph> yah, you'll find it worth reading
<raph> google gets the ip of the submissions
<raph> wow, 10k is cheap
<raph> that's like the cost of filing one patent
<raph> not an issue for me tho - the advo trust metric is public domain
<raph> really great chatting w/ you guys
<raph> about time for bed for me
<tav> .time pst
<xena> Mar. 2, 2002 11:13 pm US/Pacific
<tav> .time gmt
<xena> Mar. 3, 2002 7:13 am GMT
<raph> tav: where are you?
<tav> london
<arma> (yes, raph, you should write all of this. we want to read it :)
<raph> arma!
<raph> start a petition campaign
<tav> eta on your write up?
<raph> hey, if it gets ms to open kerberos, it might work on me getting the thesis done
<tav> anti google patent?
<tav> heh
<raph> tav: i'm planning on spending massive time this week
<raph> my job lets me do that :)
<arma> yeah, don't do the google programming contest. they're really smart to do it,
<arma> because it *will *work.
<raph> so expect it soon
<arma> excellent.
<raph> arma: why not?
<arma> i've been being lame about paying attention to your stuff lately. :(
* raph doesn't see the downside
<raph> ditto :(
<arma> er, they get the IP, right?
* raph doesn't care
<arma> hum. ok.
<raph> or, to put it more precisely
<arma> i'd say do it outside of their parameters.
<raph> there is no ip involved other than what's already been released in mod_virgule
<raph> no, wait
<raph> they would be getting, as far as i can see it, the rights to use the m/ code commercially
<raph> that i don't care _much_ about, but it's worth _something_
<raph> i have a better idea
<arma> hey, they'll fly you to san francisco if you win! :)
<raph> i'll post the idea on advo
<raph> arma: yeah, that's a major draw
* raph is only 2 degrees of separation from google top brass anyway
<raph> erm, 3
<raph> advisor's wife's students = founders of the company
<raph> arma, so what you been up to?
<raph> is reputation still alive?
<raph> one never knows in this post.com env
<arma> reputation is still alive. we will be until june.
<arma> we've got a bunch of almost-customers, but we've had them for 2 years now.
*** coderman has quit IRC (Read error: 110 (Connection timed out))
<arma> if any of them turn into actual customers in the next month or two,
<arma> then we continue until december
<raph> right
* raph wishes you luck with that
<raph> i know it's tough
<arma> if several of them convert, then we continue for at least several years, which is Good.
<raph> i assume that's been taking more attention, and chord/etc less?
<raph> freehaven
<arma> we wrote a cute little blurb on reputation and privacy enhancing technologies, for the CFP proceedings
<arma> http://freehaven.net/~arma/cfp02.html
<arma> it's got my latest thoughts on free haven, in a broad sense.
<arma> and has brief summaries of the two recent mix-net reputation papers, too
<arma> i still haven't read the oceanstore paper. the p2p workshop starts thursday.
<arma> presenting at financial crypto next monday. need to plan a talk.
<arma> i've mostly been doing damage control to make sure nothing gets too little attention,
<raph> fair enough
<arma> and putting the rest of it into making sure pet2002.org works.
<raph> hm, the fc one looks interesting
<raph> not that i've carefully read your earlier ones :)
<raph> so arma:
<arma> how carefully did you read this one? (casc-rep)
<raph> not at all :(
<arma> o. so it looks interesting from the title then ;)
<raph> one of the reasons why I'm so excited about this google thing is that it might be the analysis tool i need for gznork exchange rates
<raph> from the summary, dingbat!
<arma> gznork is the name for that thing?
<raph> gznork is my p2p network based on stamp-trading
<arma> right. that thing we were arguing about a year ago
<raph> yup
<raph> the code's been moving forward slowly
<arma> it was really cool, but had so many holes in it, i couldn't wrap my brain around it
<raph> i'm thinking of redoing it in python
<raph> (you can tell i've been spending time with zooko)
<raph> arma: yes, i would be the only one in the world who can wrap his brain around it,
<raph> ...if only i could
<raph> :)
<arma> hm. so while you're adding things to your list of hard problems,
<arma> i'll reiterate my problem from earlier
<raph> ok
<raph> (the secret to research success is to do the problems you think are easy, but everybody else thinks are hard)
<arma> in casc-rep, i suggest that we use an advogato-like metric to limit the number of Bad mix participants
<raph> cha-ching
<raph> (another cite! :)
<arma> basically, if the proportion of bad nodes is too high, the adversary can just take over the system
<raph> right
<arma> so in my analysis, i'd like to limit the fraction of bad nodes
<raph> that model is intuitively familiar to me
<arma> but advogato doesn't let me do that. it just lets me limit the *number*.
<arma> i need to start talking about distance from seeds, etc, if i want to start thinking about a fraction.
<raph> you should be able to get to fraction
<raph> yes
<raph> distance from seeds is, i believe, essential
<raph> unfortunately
<arma> i can say sigma over various distances provides a bound
<arma> yeah. i haven't thought about this one in a while. hrm.
<raph> it's interesting to note that the "seed" in advo corresponds almost exactly to E(u) in PageRank
<arma> my earlier attempt was to come up with a "proof of bandwidth" operation,
<arma> similar to proof of work,
<raph> and, in Google's case, that's actually useful
<arma> with the assumption that the adversary won't be able to pass "too many" proofs of bandwidth at once
<raph> hmm, interesting
<arma> but that turned out to be really nasty too,
<arma> because either you have a central site that 'gets' the packets to verify they all showed up
<arma> (obviously not a good plan),
<arma> or you pair people up (or put them in groups of n, etc),
<raph> yeah, proof of work is much easier
<raph> and assuming that both work and bandwidth have a fixed cost per unit,
<arma> and any time most/all of the group is bad,
<arma> then they can lie.
<raph> you really get the same result in terms of costs
<arma> they're different assumptions tho
<raph> yeah
<arma> i eventually decided that if my adversary was the nsa,
<arma> and my 'good' nodes are volunteers,
<arma> then i really better stop looking down that path.
<raph> yes
<raph> because cost is not a big problem for them
* raph tries to come up with a pun on "cost plus"
<raph> but n/m
* arma doesn't get out much :)
<raph> ok, so in the remailer network it's all source-routed, right?
<arma> in the current one, yes
<raph> (this is certainly the case for the classic mix)
<arma> in casc-rep, we build cascades out of participating nodes
<raph> oh
<arma> that is, we define fixed routes,
<arma> and traffic goes through those routes
<arma> it helps the intersection attack tremendously
<raph> ok, i need to read the paper then
<raph> yes
<arma> in some sense it's still source-routed, though,
* raph added random path choosing to premail at the request of cpunks
<arma> because if we're trying to get odds of a bad path down to one in a million,
<raph> and i always had the feeling that it _decreased_ security
<arma> then people will need to chain cascades
<arma> it sort of got fuzzy at that point
<arma> but obviously if your path is only 4 or 5 hops,
<arma> you're not going to get that level of security with any non-trivial adversary
* raph thinks
<arma> ?
<raph> i mean, any mix-like thing is still going to have the problem that if the adv can monitor bandwidth on entry and exit to the net, the intersection attack is going to work
<raph> assuming detectable traffic patterns, which i think is a _very_ good assumption in r/l
<raph> so, to me, this places an upper bound on how good you need the mix itself to be
<arma> i handwave some about approaches for having the cascades themselves send dummy messages.
<raph> ok
<arma> i think it could actually work
<raph> so... let's consider a simpler network
<raph> plain old mixes
<arma> basically the cascades need to be sending test messages anyway,
<raph> plain old source routing
<arma> and they simply address them to/from people who've previously used the cascade
<arma> ok. plain mix net.
* raph is not convinced by dummy messages
<raph> _at all_
<arma> at all? i think they raise the bar..
<raph> not very much
<raph> i get the feeling they raise it from the equivalent of 32 crypto to the equivalent of 40 bit crypto
<arma> fair enough
<arma> so. plain mix net.
<raph> yes
<raph> so the application of an advo-like tm is _really_ easy to see
<raph> the remops themselves do peer certs
<raph> ... and in this case, the cert basically means "cert subject is not nsa"
<arma> right
<arma> right. it is *not* based on observed performance
<raph> yeah, so observed performance is interesting
<raph> you can also do that the same way
<raph> if the net is going to remain fairly small, then you just do all-to-all pinging
<raph> with each note publishing its own list
<arma> well, the premise of casc-rep is a bit different:
<raph> the client just retrieves all n^2 values, and evaluates those according to trust
<arma> my goal is to allow more dynamic participants,
<raph> right
<arma> so random people can volunteer for a day, not volunteer the next, etc
<arma> i want all sorts of people to be able to join,
<arma> and the trust metric needs to handle this
<raph> so you're really building a full p2p network :)
<raph> so the trust graph should be reasonably stable even if you have join/leave
<arma> (right. it's possible i'm just nuts and the real reason the network is small is because nobody wants to be an exit relay....but i'll solve that down the road.)
<raph> well, if we're talking about the real remailer net, the deeper problem is that port 25 sucks
<raph> if you had a spam-resistant messaging system, a pseudonymity service would be quite valuable
<raph> but that's a different story...
<raph> let's keep focussed on trust
<raph> so if you do source routing, you're very dependent on your choice of seed
<raph> in fact, that becomes your killer critical problem
<arma> right
<raph> it's likely that the "trustworthy seed" problem has no technological solution
<raph> only (perhaps) social ones
<arma> speaking of,
<arma> i worry that needing to play the trust network game will seriously slow down the dynamic part of the network
<arma> you can't just grab the rpm and go
* raph notes that the recent morpheus problem can be seen as a seed failure
<arma> you need to go out for a beer with somebody, etc.
<raph> yes, you really don't have throwaway participation in a trust network
<arma> and i think that will keep the list of partipants small.
<raph> to me, the answer to that problem is clear:
<raph> you offer long-term valuable services to those who do participate
<arma> oo. another question. you know how seti@home gets people to do it because of the rankings? is there a comparable thing here?
<raph> rankings meaning amount of cycles people are burning?
<arma> right. "my team is #3"
<raph> aha
<raph> that's an interesting q
<raph> primarily social
<arma> nobody would do seti@home if there were no stats page.
<arma> that's a social thing. but it's critical.
<raph> right
<raph> it's like the peacock's tail
<raph> it has a very high cost, but _showing_ that you can afford that cost is valuable
<raph> for getting peafucked
<arma> interesting.
<raph> (it's a standard model in evo psych, if you've been following that)
<arma> i'd followed them separately, but hadn't tried connecting them.
* raph spends too much of his time thinking about this stuff :)
<arma> hey, i'm going to be in SF apr 13 through 20ish
<arma> you going to be around?
<raph> think so
<arma> great
<raph> i mean, almost certainly, for at least a subset of that time
* raph frequently has 1 or 2 day obligations like staff meetings etc)
<arma> i'm doing two workshops and a conference while i'm there
<raph> so let's get together fer sherr
<arma> but i'll have an evening free sometime.
*** coderman (~fibrill@12-224-237-178.client.attbi.com) has joined the channel
<raph> yes, i'd go out of my way to insure that
<raph> too bad it's before the google contest
<raph> 'coz otherwise we could get them to fly me to sf :)
<coderman> lol, you in the google contest raph?
<arma> so i can't talk you into going to pet2002? it'll be fun :)
<raph> maybe one day
<arma> it's a sunday and monday
<arma> i dunno, look at the list of accepted papers and you'll decide either way
<arma> so back to casc-rep
<raph> yes
<arma> there was another really interesting thing we noticed when working on it
<arma> the idea is to group mixes into cascades,
<raph> ok,
<raph> so tell me what a cascade is
<arma> and if anybody in the cascade screws up, they all suffer
<raph> in 5 words or less
<arma> a fixed-order set of mixes.
<arma> it's a path. lots of people use it.
<raph> so it's a data structure?
<raph> an ordered list of nodes?
<arma> it's the statement "use A first, then B, then C, then D."
<raph> ok
<arma> all the traffic goes into A, from there to B, ...
<raph> so, basically, you're still source-routing, but you expect many sources to choose the same route
<arma> and each of them reorders it
<raph> reordering in the classic mix sense, yes?
<arma> right. there are a limited number of routes people can pick
<arma> right. batching/permuting/etc.
<raph> ok, i think i get it
<raph> so the advantage is clear
<arma> so the deal there is that the nodes in a cascade watch each other,
<arma> and if one of them says the cascade failed, it did.
<raph> aha
<arma> all nodes in a failed cascade lose a reputation point
<raph> that's cool
<arma> all nodes in a non-failed cascade (they last a day) gain a reputation point
<arma> now the *problem* here,
<raph> it makes it v. difficult for a node to lie about being reliable
<arma> (right)
<arma> is that the adversary's strategy is to make a bunch of nodes, get them put around into various cascades,
<arma> and fail when it's advantageous.
<arma> for instance, if cascades are size 4,
<raph> right
<raph> i get it
<arma> then he will fail cascades where he has only one node, and not fail others.
* raph is fast at this kind of analysis :)
<arma> so with only a small percentage of nodes,
<arma> he can move his nodes anywhere reputation-wise.
<raph> right
<arma> suck. :)
<arma> we kludge it in casc-rep with two things:
<arma> a) a web of trust to limit the number of bad nodes
<raph> that won't work
<raph> because a small number of bad nodes is a problem
<arma> b) we pick cascades such that even if the adversary put *all* of his nodes into the range from which you're picking the 4 nodes, it's still within acceptable safety bounds.
<raph> you only need |cascade|
<arma> well, if you have only |cascade|,
<raph> ok, b is more plausible
<arma> then you haven't broken any anonymity.
<raph> really?
<arma> that was a very fuzzy statement. it needs refining. read the paper. ;)
<raph> presumably, the goal of the attacker is to create a cascade of all-bad nodes
<raph> i will
<raph> i promise
<raph> if you have an all-bad cascade, the attacker wins, yes?
<arma> yes.
<raph> ok
<raph> so the non-independence is interesting
<arma> (it can be complicated into an all-bad path rather than just all-bad cascade, if users can chain cascades. but we can ignore that for now.)
<raph> and you're assuming that cascade/path length never gets _very_ long
<raph> right?
<arma> well, we posit paths of around 12 hops, in the paper.
<raph> i was gonna say, above 10, assuring reliability is a serious problem
<arma> well, isn't that based on current mix-nets?
<raph> yes, true
<arma> in theory, cascades that aren't reliable will die early
<arma> so if you use 3 of the functional cascades,
<arma> and they stay functional, you're fine
<arma> if they don't, future people are more likely to be ok. etc.
<raph> ok, but we're talking about a relatively small constant, i think
<arma> right.
<raph> 12 is still small enough that it's realistic for an attacker to get that # accepted
<arma> yes.
<raph> so the interesting thing, to me,
<arma> but the chance we'll build a path out of only his nodes is, well, a parameter we can work with.
<raph> is that you win by making the path more heterogenous
<arma> http://freehaven.net/doc/casc-rep/casc-rep.ps btw. hm. i should commit the newest version, shouldn't it.
<raph> ie, one node with an aclu affiliation, another with cia, another with swiss govt, etc
<arma> true.
<raph> but we'll ignore that
<arma> yeah. that's hard to quantify.
<raph> so, i think you might want to kiss
<raph> in terms of trust
* raph thinks
<raph> ok, this is just off the top of my head
<raph> but having the client evaluate a simple advo trust metric on the graph (which is published) seems like the right thing
<raph> so you use the reputations for reliability only
<arma> correct
<raph> and the trust metric for trust only
<arma> yes
<raph> aha, here's a thought
<raph> if you don't trust a node, you ignore all reputation ++'s and --'s that that node participated in
<raph> hmm, no, that's problematic
<arma> how so?
<raph> the probabilities go the wrong way as lengths scale
<arma> true
<arma> well,
<arma> the *path* length goes up to 12,
<arma> but the cascade length remains fixed at 4 or 5
<raph> ah
<arma> so it could still work
<raph> that's better
<arma> we left the cascade length low for that reason
<raph> obviously, i need to read the paper
<arma> so a single bad node would have limited influence on how many others it could take down with it.
<raph> you've convinced me it's worth reading :)
<arma> in the extreme case,
<arma> it's simply free-route mixing with a buddy system
<arma> (mixes come in pairs which watch each other)
<raph> mm-hmm
<arma> http://freehaven.net/doc/casc-rep/casc-rep.ps. the current one should be fine to read. we just finished the preproceedings copy, but it's not much different.
<raph> wgotten
<arma> skip section 4 on your first reading, if you like.
<raph> fyi, gznork is completely non-anonymous
<arma> that was one of my first reaalizations
<arma> i couldn't solve double-spending without full traceability
<raph> even though it basically grew out of my thinking about how to create an anonymous system
<raph> it won't even be a good system for pirating _music_, much less terrorist planning
<arma> what'll it be for?
<raph> because if the riaa wants to be super-nasty and issue subpoenas, the (publicly readable) trust graph will tell them exactly who to go after
<raph> even if there is a provision for pseudonymous nyms
<raph> well, people will use it for that anyway, even though it's not a good idea :)
<raph> i'm kinda thinking one of the biggest legit uses is free software isos and the like
<arma> ah. so software redundancy and availability
<arma> that was cfs's stated goal too
<arma> and i think tangler's?
<raph> right
* raph is not familiar with tangler
<arma> really? o, you should look at it sometime
<arma> (in your copious free time)
<raph> is this one of your things
* raph recalls you talking about it
<arma> no, david mazieres and marc waldman, nyu
<raph> ah
<arma> .google tangler nyu
<xena> tangler nyu: http://www.cs.nyu.edu/~waldman
<arma> anyway, it's not as important as other things
<arma> but it's along the lines of free haven, but slightly more feasible.
<raph> aha
<raph> so, basically gznork is this:
<raph> an underlying stamp-trading network
<raph> the goal of that is to be able to discover trading partners
<arma> ok
<raph> and to enable trading with those, and shut out everyone else
<raph> the pattern of trading partners is chord-like
<raph> on top of that, i see implementing about 5 services
<raph> the first is vanilla bulk file transfer
<raph> and the reason why it's the first is that it's relatively easy
<raph> ie, you're not critically dependent on trust
<raph> or, to be more precise,
<arma> you don't lose much. yea
<raph> trust is still needed to keep the underlying network healthy
<raph> but it's not needed at all to provide the service itself
<raph> (we can assume that it maps some kind of hash to the file itself, avoiding file naming issues)
<raph> so it's relatively easy, and also something there's a high demand for
<arma> o right, you've got this nasty issue of actually deploying it and making people want to run it.
<arma> that's a hard enough problem in its own right. :)
<raph> well, that's not my main concern, actually :)
<raph> but yeah
<raph> so the other services are harder
<raph> #2 is probably email
<raph> and there we have a number of interesting goals
<raph> primarily spam-resistance
<arma> right
<raph> but i'll throw a couple of other things in for good measure, like transparent pk encryption and reliability
* arma chants 'pki'
<raph> that's #3
<raph> a name server, pretty much exactly the same as my thesis design
<raph> but an important point is that #2 will run without a name service
* raph chants 'ssh'
<arma> uh.. huh?
<raph> ssh works without a pki
<arma> what's a pki? what's a name service?
<raph> ok, i have no idea what a pki is
<raph> the name service is much easier
<arma> or rather, how the (*# can you do email without naming?
<raph> it's a namespace global to gznork, first-come, first-serve policy
<arma> ok.
<raph> hashes of public keys :)
<arma> hm.
<raph> which you need to instantiate the trust graph
<arma> ok. so like pgp then.
<arma> there will be several keys which claim to be arma@mit.edu,
<raph> so a <big>big</big> part of gznork is making those hashes pretty and accessible
<arma> and you'll pick the one whose trust characteristics lead you to believe it's me.
<raph> actually, no
<arma> good. then how?
<raph> ok
<raph> in the non-pki case, i'll just pick 4ab3 776a 1253
<raph> based on you having given me a beautifully printed card with that information, last time we had a beer together
<arma> business card pki. ok.
<raph> or, perhaps, the other way around, and me responding to an email you sent me
<raph> with the name service, it's a bit better
<raph> gznork enforces the 1st-come, 1st-serve policy
<raph> so you tell me that your gznork name is "arma", and i'll believe you
<raph> if somebody else took "arma" first, then you'll tell me it's "arma2" or whatever
<raph> then, i just type in "arma", and it resolves the key
<arma> right. and lying gets me nowhere,
<raph> and, when you send me mail, it says "arma"
<arma> because either i have the private key or not.
<raph> yeah, the issue is not lying, it's being fooled
<raph> but my feeling is that people can _understand_ this
<raph> as opposed to just about everything else that's said in pki-land
<arma> ah. the usability ogre.
<raph> yes, gznork is as much about usability as science
<arma> they'll understand that when mail arrives from arma2, with roger's name at the bottom,
<arma> it'll be clear that it might not be from roger?
<arma> hrm.
<raph> so, unfortunately, the one area where this falls down is trademarks
<raph> arma: some people will be able to understand that
<raph> the remainder will never, ever, ever, have the opportunity to experience secure email
<arma> others will get burned. i guess we can't win all the battles at once. right.
<raph> i imagine that the first thing that will happen when gznork catches on is for somebody to register keys for all the trademarks
<arma> cause hey, registration is free, just like the old internic
<raph> this may cause a problem if you get an email from "wells fargo bank"
<raph> actually, it would be better if registration cost $10 or so, payable to me,
<raph> but i can't figure out how to do that and make the network decentralized, attack-resistant, and so on
<arma> you could make registration free but you can send a $10 and trump an address.
<arma> if you don't want your address trumped, you better pay first.
<raph> right
<arma> evil. but anyway. :)
<arma> hey, you could even have multi-tier trumping
<raph> so, yeah, probably the right thing to do is have hierarchical name spaces, and brands
<arma> so if you really don't want it trumped, you pay $1000. etc.
<arma> "how much is it worth to you to become wells fargo bank?"
<raph> interesting concept, but i find it a bit distasteful
<raph> probably just too reminiscent of the current dns policy :)
<arma> heh
<raph> so, i create a new brand, say "faligent"
<raph> because it has to not be an existing trademark :)
<raph> and i promote the brand as being tm-friendly
<raph> so "faligent.coke" is coke, not a cocaine dealer
<raph> "faligent.ford" is not a hitchhiker's fansite
<raph> etc
<arma> and you need to know faligent's key to generate a faligent.* name?
<raph> faligent sells them
<arma> and other people can't forge them, i mean
<raph> ah, how do you enforce the policy?
<raph> there are a few ways to do that
<raph> the most dnssec-like is to have clients do the verification
<raph> so you look up "faligent" in your global namespace, then verify the chain from there
<raph> ie, faligent signs the coke cert, etc
<arma> so faligent.coke is effectively 'signed' by faligent
<arma> ok
<raph> but there are other interesting ways of doing it
<raph> this will be chapter 5 of ye olde thessys
<arma> there's a crypto way on the tip of my mind
<raph> ?
<arma> basically hashes and stuff
<arma> can't generate a faligent.foo without faligent's key as part of the process
<raph> hmm
<arma> erm. would have to get some sleep for it first.
<arma> but i'm confident it can be done.
<arma> (and has been done)
<raph> intriguing
<raph> that's faligent's private key, yes?
<arma> right.
<arma> i'd probably start looking at sfs first, to try to remind myself.
<raph> anyway, i'm not that concerned about that
<arma> are you familiar with sfs?
<raph> only vaguely
<raph> the more challenging problem is the root namespace
<arma> ok. if you're not concerned, i'll just dump that topic and move on. :)
<raph> i think if you can solve that, you can get hierarch delegation for nearly-free
<raph> ok, so services #4 and #5
<raph> #4 is firefly
<raph> but not sucky like firefly
<raph> this was going to be chapter 6 of my thesis
<raph> but ch6 is turning into an analysis of google
<raph> and #5 is perhaps the most fun
<raph> it's a mud
<raph> more specifically, you have objects for "things", "rooms", and "avatars"
<raph> and you implement semantics for them, like a thing can be inside a room
<raph> have you done any inform/zcode hacking?
<arma> no, but i wrote a mud a while back
<raph> righto
<arma> it's how i learned c. that must have been... 10 years ago?
<raph> so doing it distributed is really challenging
<arma> right
<raph> in fact, basically what you want is transactions
<arma> yep
<raph> ie, if you move a thing from room a to room b, then you want that to be atomic
<raph> so doing it as a mud is a way to keep it fun
<arma> hey, that reminds me. hm.
<raph> i'm sure that financial transactions would be equally valid
<raph> but that'll be for somebody else
<raph> btw, have you looked at miguel castro's work under liskov on byzantine transactions?
<arma> yes
<raph> ah, good
<raph> i _think_ that can be generalized to the trust-metric world
<arma> that would be neat.
<arma> it's hard stuff, from my perspective
<raph> the problem is that not all nodes have the same view of what the other nodes are
<arma> i haven't looked at it enough to be able to extend it
<arma> castro's a smart guy.
<raph> but there's an intuition why i believe it should be possible
<raph> consider the set of all nodes trusted by the good set
<raph> (ie that includes some bad ones)
<arma> the good set ?
<raph> ok
<raph> nodes are good or bad
<raph> each node trusts a set of other nodes
<arma> ok. sounds good.
<raph> in the classic byz model, the trust set is the complete set
<arma> now we have defined good, bad, confused
<raph> and is the same for all nodes
<raph> and you win as long as 2/3 of the nodes in the set are good
<raph> so my intuition is to consider any node which is trusted by only some of the good nodes as bad
<raph> and if you still have a 2/3 majority, i think you still win
<raph> but i'm not sure
<raph> as you say, byzatine transaction foo is hard
<raph> (it's interesting how many ppl i know who got into hardcore programming through implementing a mud)
<raph> (at least one programming languages guy at berkeley, and several others i know through free software)
<arma> hey, bbs, play mud, run mud, move on
<arma> it's how we all did it, right?
<raph> not me
<raph> but i'm wierd
<raph> ok, i really need to get to bed now
<arma>  http://mit.edu/arma/Public/6.313f.tex is a paper i wrote for tom knight ~4 years ago,
<arma> on implementing a distributed mud
<arma> anyway. :) yes, sleep.
<arma> and read casc-rep and tell me if you buy it.
<raph> yes
<arma> (wow. i hadn't thought of that as being p2p. but it needs to be.)
<raph> i probably won't
<raph> buy casc-rep, that is
<arma> (the mud thing, i mean)
<arma> ok. good. :)
<raph> but this is high praise
<raph> there is no other work i've seen on anonymity that rates that high
<raph> oh cool, you really have thought about this stuff
<arma> hm?
<raph> no wonder you're at least familiar with castro
<arma> i used castro's byzantine toolkit in the submission to info hiding 4
<arma> then we found a better way to solve our problem
<arma> (that submission is [7] in casc-rep)
<raph> aha
<arma> (er, turned into [7], that is. it changed a *lot* between acceptance and preproceedings.)
<raph> ok, will probably read casc-rep tomorrow
<arma> those were the days. we wrote that paper in 10 days,
<raph> unless i get completely caught up in writing
<arma> from deciding to write a paper, to finding a topic, to solving it, to submitting the paper.
<arma> what did you mean 'rates this high'?
<arma> in your eyes, or is there some global chart i should know about?
<raph> in my eyes :)
<arma> (damn these decentralized approaches ;)
<raph> there's quite a bit of work on anonymity that's utter crap
<raph> but i don't have to tell you this
<raph> s/work/"work"/ :)
<arma> yeah. we're on the right track. we're still partially crap,
<arma> but i'm coming to understand this,
<arma> so hopefully i can do better next time.
<raph> yup
<raph> this is why i do graphics
<raph> the problems are soooo much easier
<arma> ...and then realize that *that* is crap, and do better the next, ...
<raph> and it's much easier to convince people that you've solved them :)
<raph> ok now bed
<raph> hope to talk to you again soon!
<arma> good night.
*** raph has quit IRC ("sleep")
<arma> yep. you too
<arma> raph always distracts me so well. i've gotten nothing done the past hour. foo.