Schrodinger's Cat Strikes Back

Home » Technical issues » Nice to have features of the data import facility

Nice to have features of the data import facility


As already explained, the early data base of the new physics site will contain all of the Theoretical Physics SE questions, plus selected questions from elsewhere (mostly Physics SE I guess for now). There are some attribution issues we have to respect (some boring tasks can hopefully be automated), and importing Theoretical Physics questions works already reasonably well 🙂

Polarkernel now kindly asks us to specify the most important features we’d like the data import facility to have, such that he does not have to provide everything that could be imagined per default. As I see it, it would be nice to

  • get some tasks in the context of attribution  (adding the links and the Physics SE symbol) automated.
  • let the former TP users access their old posts and accounts. This can probably be resolved by setting up an appropriate meta thread where they can reclaim their account (if the forgotten password facility can not be used for some reason).
  • be able to select questions to import from Physics SE data dumps by tag and / or users, to get the most interesting ones without introducing too much low-level noise.
  • be able to “real-time” import questions from SE data dumps while the site is running online.

The possibility to salvage Physics SE questions in real-time while the new physics site is running online, seems to become more and more important unfortunately: in the context of upcoming changes in policies, Physics SE seems to gravitate towards closing and deleting everything that is currently tagged homework, including very high-level advanced topic technical questions listed in this chatroom that have wrongly attached this tag to them (!). The new books policy, which finally replaces the since a year enforced rule  that any questions asking for study material / references as naturally defined and needed by serious students and researchers, are forbidden, is finally implemented. However, the relaxed rule does only apply to future questions. Old reference / study material  questions (again as defined by researchers instead of SE politicians …) will most probably get destroyed  and the information they contained hidden (from the related question detector and the search facility of Physics SE) in graveyards called tag-wikis. Even questions that are only 10 days old (!) and that have been asked correctly according to the new books policy, will rather get destroyed than reopend …  The terrible idea of hiding very useful content in non-searchable for people logged in (but googlable) tag-wikis almost nobody looks at, came of course up on MSO and some people who for some reason are always fond of the worst unhelpful up to destructive ideas and policies posted on MSO, can not let go and accept that some of these fads may fail on some (non Trilogy) sites in the network …

… but back to business now ;-), I’d like to ask anybody interested to add additional ideas and thoughts about what would be nice and important features our data import facility should have, in the comment discussion.



  1. My answers:

    1. Obviously. We don’t want to get sued.
    2. Obviously. PSE users too, actually. I’d want to be notified if someone comments on one of my answers, or replies to my comments. But note, I think that the idea of having a “meta thread” doesn’t make much sense (it does, actually, but it’s not the entire thing) (contd later) .
    3. Obviously.
    4. I don’t think that’s all that important. We can handpick some questions regularly to add to PO. Downloading the entire dump again, un7zipping-, filtering by tag, and so on, is just near to impossible, and boring. Also the site should be self-sustainable. You can’t have a continuous stream of PSE questions coming in. Some handpicked questions are okay, but not such a stream of questions.

    Now some comments (the third one is more important, actually it’s conftinuation):

    * Here’s a design for the PSEsame banner:
    * Here’s one for the TPSEsame one:
    * Here are some tags to filter:
    * When you talk about having a meta thread for people to reclaim the accounts, I don’t understand what you mean (contd later)
    * We should pick certain questions from PhysSEs, and only take the either (a) endangered, or (b) really high-level ones. It’s pretty useless to take a question on say basic advanced GR which people are ok with (lots of the polite-icians are not *so* stupid, e.g. even ManishEarth knows GR/basic QFT, I think.). But if it’ is a question about a certain interesting conjecture or observation (maybe in string theory, there have been a couple of them, I think Mitchell Porter has posted quite a few) or a new Ron Maimon posting ideas on cold fusion : ) , that is going to be pretty interesting, and VERY ENDANGERED. So such a question should be urgently copied over, maybe even tagged as [tag:stackexchange-emergency] : ) .

    Ok, now continuing about the meta thread thing,

    I thought that we’d be having a meta thread to redirect users to a page where they could submit their e-mail, then the script would encrypt it and check it against the data dump, and then e-mail them their account details?

    Does this not work ?!

    I don’t get your idea. Are we going to blindly give them their account details without actually confirming that they’ are genuine, and not hackers?! That’s obviously very dangerous, and probably illegal, too.

    Or do you mean that we let them provide their e-mail, then we manually MD5 encrypt it and manually send them a confirmation e-mail with their account details? That would be fine, but why wouldn’t that work with an automatic script?

    I’m sure it’s MD5, by the way. I’ve already tested thait using my own e-mail.

    • polarkernel says:

      Let me clarify some aspects of migration to Q2A from the technical point of view, in order to give you an impression of the implementation effort that will be required.

      *Migration of selected questions
      Let us assume that a handpicked endangered question has to be imported to Q2A. By question I understand the question itself and maybe a thread of answers and comments. Each of these parts will have a corresponding user (maybe not yet created in the Q2A site). Additionally, if you like to keep the voting data from SE, there will be additional users that voted the question or answers. Therefore, importing such a thread would mean to provide the question, their answers and votes and all the corresponding data of the different users. I think the extraction of this bunch of data could be complicated and I have not yet a solution at hand that would facilitate this task. The insertion of this data into Q2A and the check, if the users already exist, would be feasible, maybe by development of a plug-in.

      *Reclaim of user account
      Users, which like to reclaim their account, would have to provide their display name (the original name is not exported to a SE dump) and their email address. This data could be technically be collected manually in a list of any format (CSV, Excel, XML, …) or by a plug-in developed by us and provided to the user. As proposed by Dimension10, a simple check of the identity should be preferred. Encrypting the email address by MD5 and checking the result against the entry in the SE-dump could be executed automatically. If the address has been found to be valid, it could be inserted in the corresponding Q2A database field also automatically. Q2A provides then a “forgotten password” utility. Usually, the user announces that he has forgotten his password and Q2A sends a random password to the email address stored in the user part of the Q2A database. The user may then access his account by the display name and this password. Naturally then he can easily change his password. I did not yet try but maybe we could trigger this utility automatically, when the email address has been checked.

      The only situation where the authentication will not work occurs when there are users that do no more own their email account from 2009. I think, there will be only few of them and we will find some pragmatic way to reset their new email by hand.

      *Additional Request
      If possible it would be helpful for me to see, if your requirements are only valid during the transient phase during the setup of the site or for a longer term, maybe even during the lifetime of the site. In the transient phase I can give any sort of advises or even access the database at low level to make the project run. Later, the state should be stable and targeted for a long and successful live, even if the administrators will eventually change.

      • Ah thanks.

        As for migrating endangered questxions over once the site is running, I meant manually doing it using a single account (the posts could then be reassigned to the real users if they make an account) since the data dump is released only once in 3 months.

        I guess that for users who no longer have access to their old 20009 email accpounts, such users would most likely be researchers who changed their affiliation, so I guess the new account could be manually verified.

    • Dilaton says:

      Concerning the reclaiming of the accounts, I first thought that we could do it similar to MO after the movement to SE …

      But there may be simpler ways as you and Polarkernel say 😉

      Huh, the link to the PSE banner seems to be the same as the one to the TPSE one …?

      Yes of course should the new site NOT be a subset of PSE …

      So in the transition time we might need to extract a larger number of questions from PSE (I’d like to backup my posts for example), and then I agree that we can pick interesting or endangered ones….

      At present it seems that high-level technical questions tagged homework and old book questions are (more or less strongly) endangered. Maybe we could even make an asylum category for them (half joking) or yes a tag …

  2. About endangered questions, here’s another endangered question (or answer):

    Just because it says “Cold fusion is real”, even my comments supporting the answer (like “+1 seems like a good answer, though I can’t fully understand it”) and promoting BPS Overflow (like “Unfortunately, The recent censorship at Physics.SE could get your answer deleted,[…] you may be interested in joining a higher level Physics site, where these kinds of answers are safe. See“) get deleted. !

    I am afraid about the answer. I should wayback machine it.

    • Dilaton says:

      As this question and its answers do not contain too much LaTex which does not yet properly work, I could copy the whole thread together with for example the links to the profiles of the posters into my test Q2A site, to at least roughly save the content.

      If you do not have a better and more efficient idea, I could maintain a category for endangered questions which may not be in any data dump if deleted (as data dumps are stored only every blue moon) in my test site …

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: